Sharding / Planet Scale

jenningsanderson commented 7 years ago

The bad news is that currently this does not scale to the planet. Not sure what the major bottleneck or problem is here but the indexes stop growing ~ 60GB.

Solutions:

Shard geographically?

+ Geographic Indexes make sense; we know where the data should be
- Cannot be guaranteed of size. Will run into the same problems.
- Convoluted geographic boundaries. Even by Country: Germany / France will need to be split into multiple regions within the year

Shard by ID & Type?
- Will need to maintain open connections to all databases when reading (and maybe writing, if input file is not perfectly sorted)
- Better control of size / predictability. 1 Billion Entries per Node DB, 500M entries per Way DB
- Easy to know what DB to lookup in based on the object type and ID; simpler keys available

Not a major, major priority, but worth investigating.

/cc @lukasmartinelli

lukasmartinelli commented 7 years ago

Investigating the root problem a bit. I don't want really want to hack my way around a bug in the core program. Investigating a bit

But if so then sharding by id and type probably makes most sense because when augmenting we can choose the correct db to read from.

jenningsanderson commented 7 years ago

Alright, we're making progress -- at least with logging!

There are multiple things happening here, we need to better tune the flushing / compacting; currently it's stalling every 45 seconds or so; however, the database is still refusing to grow past ~62GB and is only creating 1011 .sst files (same as every run before it).

Also, it's running about 10-15x slower than before. Letting it run to see final output with better logging.

ubuntu@ip-10-0-0-246:/data2/global-tag-history$ tail PLANET_3/LOG
2017/08/11-04:19:13.124102 7f78ce9a5a00 [WARN] [db/column_family.cc:629] [nodes] Stopping writes because we have 23 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2017/08/11-04:20:06.000575 7f78ce9a5a00 [db/db_impl_write.cc:734] [nodes] New memtable created with log file: #3. Immutable memtables: 23.
2017/08/11-04:20:06.000623 7f78ce9a5a00 [WARN] [db/column_family.cc:629] [nodes] Stopping writes because we have 24 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2017/08/11-04:20:58.536920 7f78ce9a5a00 [db/db_impl_write.cc:734] [nodes] New memtable created with log file: #3. Immutable memtables: 24.
2017/08/11-04:20:58.536970 7f78ce9a5a00 [WARN] [db/column_family.cc:629] [nodes] Stopping writes because we have 25 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2017/08/11-04:21:47.940815 7f78ce9a5a00 [db/db_impl_write.cc:734] [nodes] New memtable created with log file: #3. Immutable memtables: 25.
2017/08/11-04:21:47.940854 7f78ce9a5a00 [WARN] [db/column_family.cc:629] [nodes] Stopping writes because we have 26 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2017/08/11-04:22:19.835942 7f78ce9a5a00 [db/db_impl_write.cc:734] [nodes] New memtable created with log file: #3. Immutable memtables: 26.
2017/08/11-04:22:19.835981 7f78ce9a5a00 [WARN] [db/column_family.cc:629] [nodes] Stopping writes because we have 27 immutable memtables (waiting for flush), max_write_buffer_number is set to 2
2017/08/11-04:22:56.507068 7f78ce9a5a00 [db/db_impl_write.cc:734] [nodes] New memtable created with log file: #3. Immutable memtables: 27.

jenningsanderson commented 7 years ago

Aha! This arbitrary limit is a function of our compaction approach (maybe we need to switch to level compaction)? This is because we are currently leaving every .sst file open for write, which is running up against system open file limits, which on my box are 1024. Add a log file and some other processes, and we get our 1011 .sst file limit.

We may need to change the compaction type to better reflect our use case? (We don't know how many files we may need to keep open?)

Further, we need to increase individual file sizes; we are currently hitting this limit (with the current options) at ~100M nodes; the planet is 680M nodes. With ways, etc., we need to be preparing for ~ 1T for the planet index (which makes sense, that's about how big the planet XML is.

Just thinking aloud, new limits to consider:

4096 max files (need to change OS settings to allow)
256MB target file size. (I currently have this set, but it's still only creating 64MB .sst files :/

= ~ 1 TB of space we can add

I just arbitrarily upped my file limits to 16k, let's see how long it runs.

Update: `ulimits` solves the problem of creating the global geojson output file.*

For the global scale add tags, we see 482M history values found total for the 545M input features. The lookup fail rate is about 4.3%; which isn't great, but on par with other regions. This is likely the missing versions where all tags were deleted.

New Problems: Creating an mbtile file from this much data with tippecanoe is super inefficient. Geographical Sharding seems like the best idea to create Country / Region histories.

Alternatively, there are maybe better ways to encode this with tippecanoe? If we could disable parts of the the metadata table, that might be a good start? Once we get the import / add_tags a bit more ironed out, might be worth asking Eric Fischer what makes the most sense for tiling.

lukasmartinelli commented 7 years ago

Creating an mbtile file from this much data with tippecanoe is super inefficient. Geographical Sharding seems like the best idea to create Country / Region histories.

Did tippecanoe never complete?

Just thinking aloud, new limits to consider:

4096 max files (need to change OS settings to allow) 256MB target file size. (I currently have this set, but it's still only creating 64MB .sst files :/

I think setting ulimits is fine on the machine. That's recommended for a lot databases afaik. We can also set the max open file handles for rocksdb itself.

We may need to change the compaction type to better reflect our use case? (We don't know how many files we may need to keep open?)

Even with compaction enabled always - it hit's that limit. If I execute it with compaction disabled i writes start to fail.

Stored ~215815854/320000000 nodes

--- a/db.hpp
+++ b/db.hpp
@@ -80,7 +80,7 @@ public:
         rocksdb::Options db_options;
         db_options.allow_mmap_writes = false;
         db_options.max_background_flushes = 4;
-        db_options.PrepareForBulkLoad();
+        //db_options.PrepareForBulkLoad();

lukasmartinelli commented 7 years ago

This is because we are currently leaving every .sst file open for write, which is running up against system open file limits, which on my box are 1024.

This is the default config https://github.com/facebook/rocksdb/blob/dfa6c23c4b6589479df998701368336f07e8912c/include/rocksdb/options.h#L386-L392

Trying this now with a set limit and NO WRITE BATCH to see whether import perf get's better and index still grows.

--- a/db.hpp
+++ b/db.hpp
@@ -80,6 +80,7 @@ public:
         rocksdb::Options db_options;
         db_options.allow_mmap_writes = false;
         db_options.max_background_flushes = 4;
+        db_options.max_open_files = 4096;
         db_options.PrepareForBulkLoad();

e-n-f commented 7 years ago

What kind of Tippecanoe problems are you running into? If there's something that works really badly, I'd like a copy of the input so I can fix it.

lukasmartinelli commented 7 years ago

What kind of Tippecanoe problems are you running into? If there's something that works really badly, I'd like a copy of the input so I can fix it.

@ericfischer It is a planet GeoJSON dump of 400-500 GB containing the same features that QA tiles have as input but it also has all historic versions attached to it. @jenningsanderson knows more what the problem is.

I don't have a GeoJSON dump right now. @jenningsanderson does. I'll try to create on and throw it on S3.

"@history": [ 
  <history object version 1>,
  <history object version 2>,
  ...
  <current version of object>
 ]

e-n-f commented 7 years ago

Thanks @lukasmartinelli. Something that size will be difficult to handle well but should be a great test case.

jenningsanderson commented 7 years ago

:wave: @ericfischer @lukasmartinelli, The tileset finished, took ~50 hours with an EC2 c4.4xlarge (16 core, CPU optimized).

The tileset was rendered with:

tippecanoe -PF -pk -pf -Z15 -z15 -B17 -d17 -b0 -l history -o history.z15.mbtiles -t /data3/tmp planet.history.geojsonl

^ Whoops, I just saw that -B17, that should be -B15, but I don't think it matters if it's higher than -z?

Using the schema here: https://github.com/mapbox/osm-tag-history#historical-feature-schema-for-tags, the data in this tileset looks like:

Compare to history here: http://www.openstreetmap.org/api/0.6/way/48330746/history Note that version 3 added a node and version 4 added two nodes.

I'm currently uploading the input geojson file to S3, will share when available

lukasmartinelli commented 7 years ago

Trying this now with a set limit and NO WRITE BATCH to see whether import perf get's better and index still grows.

Without write batch it imported in i2.xlarge within 7hrs and no drops, just requires a high ulimit. Trying with write batch whether it slows down a lot.

lukasmartinelli commented 7 years ago

Without write batch it imported in i2.xlarge within 7hrs and no drops, just requires a high ulimit. Trying with write batch whether it slows down a lot.

It's just as fast. No big difference 6 hours 25 minutes 59.2 seconds.

jenningsanderson commented 7 years ago

:wave: @lukasmartinelli - I'm working on a full version of North America for SOTMUS, An easy win with this issue seems to be increasing the size of the .sst files; 64MB seems very low; regardless the settings I change in db_options for target_file_size_base, the SST files are still ~64M.

I'm currently running on a machine with a hard ulimit of 8192, so I don't have the luxury of just opening more files :/

Excited to get back to this project after SOTMUS...

mapbox / osm-wayback