use more cores - Githubissues

ghost commented 3 years ago

The memory footprint is stable and low but the ingest is only using about 2 cores out of 10 on the vps for the first phase. It could be that this is as fast as a single leveldb will go. I don't think the machine is reaching IO saturation since after 23 hours the leveldb dir is only 111GB.

Some options to explore for the first phase:

spawn more threads for writing with a single shared DB in an Arc. Maybe flush() could spawn a new background thread and not wait for it to finish before continuing while gathering more records for the batch, only obtaining a mutex lock when it needs to flush() again.
partition the keyspace (in a balanced way, using id%n for example) to write out to multiple leveldb databases

The second phase could use some of the same tricks and there are many places in eyros where async operations happen serially instead of in parallel.

The osm2pgsql page states that they can process planet-osm in about a half a day so we have some room for improvement although the peermaps ingest uses far less memory already (osm2pgsql requires a minimum of 64GB ram).

ghost commented 3 years ago

The second option for the first phase of partitioning the keyspace would have the benefit of being able to farm out the work across a cluster of volunteer systems, which is a long-term goal.

ghost commented 3 years ago

The new scanning approach is much better at using multiple cores when it can and the osm decoding happens in parallel at pre-calculated offsets.

peermaps / ingest

use more cores #13