memory optimisation for extract using roaringbitmap?

missinglink commented 1 year ago

What version of osmium-tool are you using?

osmium version 1.13.1
libosmium version 2.16.0
Supported PBF compression types: none zlib

What operating system version are you using?

Darwin Peters-MacBook-Pro-2.local 21.3.0 Darwin Kernel Version 21.3.0: Wed Jan  5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_X86_64 x86_64 i386 Darwin

Tell us something about your system

64GB RAM

What did you do exactly?

Cutting many extracts uses a large amount of memory (as documented).

What did you expect to happen?

I expected the internal memory requirement to be lower, assuming that a bitmap structure was being used to track dependent OSM elements.

What did happen instead?

It seems that element IDs are being stored as integers?

https://github.com/osmcode/libosmium/tree/master/include/osmium/memory

What did you do to try analyzing the problem?

I would just like to enquire if something like https://roaringbitmap.org/ could be used to keep track of which elements are required in each extract, in order to greatly reduce the memory requirements?

missinglink commented 1 year ago

Digging a little deeper I see that the IDs are stored in an osmium::index::IdSetDense.

The implementation is using a bitmap under the hood, seems I assumed wrong there.

I'm not super familiar but roaring might provide an improvement, I believe there are three strategies they use depending on the total members in the set, but it might be negligible while also introducing a dependency.

missinglink commented 1 year ago

I'm not a CPP programmer but it seems that this line is allocating a lot more memory than it really needs.

I might be reading this wrong but it seems that the size of m_data is such that it can always fit the highest chunk_id, which explains the documentation:

For the simple strategy it will at least be the number of extracts times the highest node ID used divided by 8

The issue with using a vector rather than using a map is that there are potentially large sections of the vector which are empty, is this consuming memory?

missinglink commented 1 year ago

I've generated a custom PBF file which demonstrates the issue: osmium-oom.pbf (1.1KB)

When I run the following command all the RAM on my system is consumed:

osmium extract -s 'smart' -b '0,0,2,2' -o out.pbf --overwrite osmium-oom.pbf

The input file contains 73 elements, all of which have IDs under the max IDs currently available in OSM

joto commented 1 year ago

Are you just playing around with this and had an idea for an improvement or do you actually have some problem using it with OSM data?

missinglink commented 1 year ago

It seems the Bitmask implementation in osmium::index::IdSetDense consumes a lot more memory than required, particularly for sparse sets.

I wanted to open a ticket to discuss before implementing anything, to see if it was worth putting in effort to implement a different Bitmask algorithm, and whether you would consider merging it.

It's hard to do a definitive comparison on implementation because they all perform differently in terms of memory consumption and CPU utilization depending on how sparse/dense the set is.

The main thing I would consider changing is that the Bitmask size correlates to the highest ID in the set rather than the population of the set.

Is there any reason why you're using a Vector instead of a Hashmap?

joto commented 1 year ago

I assume this means "just playing around". Feel free to do that and implement something better. The reason the code is the way it is, is because it is simple and because it works for the stuff that people generally do. I am fully aware that it is not perfect and that it will not work for all use cases, but it is hard to write something that works in all cases, uses a minimum amount of memory, is performant etc.

If you come up with something better which is supported by tests, sure I'll consider it. If you want to introduce a new dependency, like RoaringBitmap, there is a good chance that I'll reject it, though. Adding dependencies comes with a lot of cost for users of the library, so it better be worth it. As a minimum it must be available for all platforms we support, which also means somewhat older Ubuntu versions etc. (If it can be optional that would be a possibility also.)

Is there any reason why you're using a Vector instead of a Hashmap?

I assume you are talking about IdSetSmall? Because a sorted vector needs less memory than a hash and if the set is "small" the access time for a vector might even be better, it is certainly faster to create. (Especially true for std::unordered_map/set which is rather slow compared to, for instance, the Abseil hash map (but see "dependencies" above.)

missinglink commented 1 year ago

..it is hard to write something that works in all cases, uses a minimum amount of memory, is performant etc.

totally agree, this is why I would advocate using Roaring as it's well studied and battle-tested, rather than hacking something myself.

Adding dependencies comes with a lot of cost for users of the library

The CRoaring library offers an 'amalgamation' build, would you consider including this in libosmium if it was proven to have significant benefits over the existing bitmap implementation?

missinglink commented 1 year ago

My CPP is terrible but I managed to butcher together a branch for testing here: https://github.com/osmcode/osmium-tool/compare/master...missinglink:osmium-tool:roaring?expand=1

mmd-osm commented 1 year ago

The comment underneath details is no longer relevant for the problem at hand…

For my Overpass API fork I implemented "IdSetHybrid" which combines ideas from IdSetSmall and IdSetDense by using whatever representation (sorted int vector+ binary search, or bitmap) uses less memory. Also, the vector bucket stores 64 bit values as 32 bit to further cut down memory usage. Conversion from one representation to the other happens on the fly, once the size of a bucket hits a certain threshold value. If I understood the Roaring bitmap paper correctly, one part of their implementation follows a similar pattern. I believe libosmium already comes with an IdSet implementation, which is using a global switch to migrate from binary search to bitmaps, whereas in my implementation, this decision is being made on a bucket-by-bucket basis. The implementation is not perfect, and doesn't support all of the methods provided IdSetDense. Still it was good enough for my use case. Feel free to play around with it: https://github.com/mmd-osm/Overpass-API/blob/test7591/src/overpass_api/data/abstract_processing.h#L38-L199 This code doesn't introduce any new external dependencies, like CRoaring.

missinglink commented 1 year ago

I've some some basic benchmarking of master/roaring/IdSetHybrid which provides some insights:

the existing IdSetDense implementation in master is optimized for speed but not for memory consumption
the roaring implementation uses 20% of the memory consumed by IdSetDense. IdSetHybrid uses ~30%.
the IdSetHybrid implementation is on par with IdSetDense in terms of speed. roaring is ~20% slower, even with a write buffer.

The reduced memory requirement was very useful for me when cutting down the planet file to smaller extracts, as per Geofabrik downloads. Previously the machine would go to swap with 128GB RAM, it now uses ~20GB.

I think it would be nice to have a default bitmask implementation for osmium extract which allowed it to run on modest hardware, the IdSetHybrid implementation provides a ~70% reduction in memory usage with no performance penalty.

mmd-osm commented 1 year ago

When I run the following command all the RAM on my system is consumed:

osmium extract -s 'smart' -b '0,0,2,2' -o out.pbf --overwrite osmium-oom.pbf

The reason why your example file takes so much memory is the single node with a negative node id:

<node id="-1" version="1" changeset="1" lat="1" lon="1"/>

If you change that node from e.g. -1 to 2 in the file, the memory consumption is insignificant. IdSetDense works with unsigned integers, so -1 will end up as a fairly large positive value (0xFFFFFFFFFFFFFFFF).

I'd say, this issue is probably a duplicate of #234, or at least a related issue. Also the question comes up if this is a realistic test case in the first place. Planet files don't contain any negative object ids.

alfh commented 1 year ago

Hi

I've read the interesting discussion here.

I have one case where I am investigating the osmium-tool extract performance. I am using the https://github.com/treee111/wahooMapsCreator tool to create the files used for navigation on the Wahoo bicycle computers, using the lastest OpenStreetMap map data.

To generate the files for Norway, it takes about 10200 seconds, and the osmium extract time is about 7800 of these seconds. The tool breaks Norway up into 1176 tiles.

My initial investigation showed that it looked promising to make one json input file to osmium-tool extract, and have it create the 1176 output files, instead of invoking the osmium-tool extract 1176 times, one for each tile, which is what the wahooMapsCreator tool is currently doing. Any thoughts on this ?

My initial test based on 30 tiles, was that I could decrease the time used by 70%, by invoking osmium tool once instead of 30 times. But then I ran into memory issues when using more input, having 64GB RAM on my home computer, and also testing on a cloud VM with 128GB RAM.

So I looked at this discussion, and made some tests. I took the https://github.com/missinglink/osmium-tool fork, the roaring branch, and applied the changes for using CRoaring to the latest osmium-tool master branch, and also updated the CRoaring to the latest version. I also took away the newly added check that no more than 500 extracts was present, since I just upped the ulimit to 2048 on linux.

The CRoaring also looks interesting since they are working on adding AVX2 and AVX512 support, but I am not sure if that would eventually help on the performance, https://github.com/RoaringBitmap/CRoaring/issues/454.

I observe the similar things as @missinglink , which is that using the CRoaring really reduces the memory usage sharply, but the performance also goes down. But I see missinglink mentions "IdSetHybrid" implementation. Could you provide more details there, @missinglink ? Perhaps push the branch to your github fork, @missinglink ?

It would be really interesting to test how the memory usage / performance would be for my case using that hybrid approach. I think I could save 1 of the 2 hours used for the extraction for Norway with such a change.

I also notice that the CPU usage is around 100% on a 16 core machine, i.e. only one core is really used. Is that as expected ?

The memory usage using the CRoaring for Norway was around 6GB (shown by htop) (but virtual memory usage was 86GB). When using the normal osmium-tool, it ran out of memory as soon as phase 2 started (the machine has 128GB RAM)

Here is the "/usr/bin/time -v" output on linux for my run for the "filtered_names" input file.

[ 0:00] Additional strategy options:
[ 0:00]   - [types] relation types: multipolygon
[ 0:00]   - [complete-partial-relations] do not complete partial relations
[ 0:00]   - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[14:13] Second pass (of three)...
[15:03] Third pass (of three)...
[======================================================================] 100%
[38:50] Peak memory used: 86331 MBytes
[38:50] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/norway_split.json /opt/data/alf/wahooMapsDisk/_tiles/norway/filtered_names.o5m.pbf -s smart --overwrite"
        User time (seconds): 2387.40
        System time (seconds): 4.80
        Percent of CPU this job got: 102%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 38:50.55
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 4023352
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 60
        Minor (reclaiming a frame) page faults: 794752
        Voluntary context switches: 333693
        Involuntary context switches: 11170
        Swaps: 0
        File system inputs: 3938680
        File system outputs: 1500920
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

And here for the filtered input file, with example of how the size of the tiles are.

[ 0:00] [1174] Output:      /tmp/alf/norway/_tiles/153/29/split.osm.pbf
[ 0:00]      Format:      PBF
[ 0:00]      Description:
[ 0:00]      Envelope:    (35.15625,79.6871842,36.5625,79.9359182)
[ 0:00]      Type:        bbox
[ 0:00]      Geometry:    BOX(35.15625 79.6871842,36.5625 79.9359182)
[ 0:00] [1175] Output:      /tmp/alf/norway/_tiles/153/30/split.osm.pbf
[ 0:00]      Format:      PBF
[ 0:00]      Description:
[ 0:00]      Envelope:    (35.15625,79.4323708,36.5625,79.6871842)
[ 0:00]      Type:        bbox
[ 0:00]      Geometry:    BOX(35.15625 79.4323708,36.5625 79.6871842)
[ 0:00] [1176] Output:      /tmp/alf/norway/_tiles/153/31/split.osm.pbf
[ 0:00]      Format:      PBF
[ 0:00]      Description:
[ 0:00]      Envelope:    (35.15625,79.1713346,36.5625,79.4323708)
[ 0:00]      Type:        bbox
[ 0:00]      Geometry:    BOX(35.15625 79.1713346,36.5625 79.4323708)
[ 0:00]
[ 0:00] Additional strategy options:
[ 0:00]   - [types] relation types: multipolygon
[ 0:00]   - [complete-partial-relations] do not complete partial relations
[ 0:00]   - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 8:03] Second pass (of three)...
[ 8:29] Third pass (of three)...
[======================================================================] 100%
[20:04] Peak memory used: 85388 MBytes
[20:04] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/norway.json /opt/data/alf/wahooMapsDisk/_tiles/norway/filtered.o5m.pbf -s smart --overwrite"
        User time (seconds): 1237.67
        System time (seconds): 3.28
        Percent of CPU this job got: 103%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 20:04.70
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3007120
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 554233
        Voluntary context switches: 182617
        Involuntary context switches: 5775
        Swaps: 0
        File system inputs: 2435536
        File system outputs: 859352
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

File sizes for Norway, although I appreciate that it is really the highest node id value to influences max memory requirement.

filtered.o5m.pbf
filtered_names.o5m.pbf

So total time is about 3500 seconds, which is even a lot better than the 7700 seconds (although on a different machine, but CPUs are quite similar.)

I am also doing testing on Germany and France, which both are broken up into less than 100 tiles. But there are still memory issues there. I will come back with some results there, just taking 50 tiles, which I think will work for both the standard osmium-tool and the variant with CRoaring. I also have a more extreme case with Canada, with over 4500 tiles that I will look into.

I then tried with another cloud machine, with 176 GB RAM. It ran out of memory just after starting phase 2 for Norway.

For Germany, I could complete one test with standard osmium and one with the croaring implementation. Standard osmium

[ 0:00] [58] Output:      /tmp/alf/germany/_tiles/138/84/split.osm.pbf
[ 0:00]      Format:      PBF
[ 0:00]      Description:
[ 0:00]      Envelope:    (14.0625,51.6180165,15.46875,52.4827802)
[ 0:00]      Type:        bbox
[ 0:00]      Geometry:    BOX(14.0625 51.6180165,15.46875 52.4827802)
[ 0:00] [59] Output:      /tmp/alf/germany/_tiles/138/85/split.osm.pbf
[ 0:00]      Format:      PBF
[ 0:00]      Description:
[ 0:00]      Envelope:    (14.0625,50.7364551,15.46875,51.6180165)
[ 0:00]      Type:        bbox
[ 0:00]      Geometry:    BOX(14.0625 50.7364551,15.46875 51.6180165)
[ 0:00]
[ 0:00] Additional strategy options:
[ 0:00]   - [types] relation types: multipolygon
[ 0:00]   - [complete-partial-relations] do not complete partial relations
[ 0:00]   - [tags] no tags defined
[ 0:00]
[ 0:01] Running 'smart' strategy in three passes...
[ 0:01] First pass (of three)...
[ 2:27] Second pass (of three)...
[ 3:04] Third pass (of three)...
[======================================================================] 100%
[ 4:16] Peak memory used: 169865 MBytes
[ 4:16] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool/build/osmium extract -v -c /home/alf/wahoo/google_test/germany.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered.o5m.pbf -s smart --overwrite"
        User time (seconds): 385.59
        System time (seconds): 53.86
        Percent of CPU this job got: 169%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.61
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 168580104
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 21447621
        Voluntary context switches: 189366
        Involuntary context switches: 2665
        Swaps: 0
        File system inputs: 9972072
        File system outputs: 3298808
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Osmium modified to use CRoaring


[ 0:00] Additional strategy options:
[ 0:00]   - [types] relation types: multipolygon
[ 0:00]   - [complete-partial-relations] do not complete partial relations
[ 0:00]   - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[12:38] Second pass (of three)...
[13:57] Third pass (of three)...
[======================================================================] 100%
[23:19] Peak memory used: 10100 MBytes
[23:19] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/germany.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered.o5m.pbf -s smart --overwrite"
        User time (seconds): 1575.75
        System time (seconds): 6.33
        Percent of CPU this job got: 112%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 23:20.08
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 3234472
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 29
        Minor (reclaiming a frame) page faults: 722442
        Voluntary context switches: 298579
        Involuntary context switches: 7287
        Swaps: 0
        File system inputs: 9977016
        File system outputs: 3304032
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So for this input file, the croaring impl was actually 6 times slower, but used 95% less memory.

For the Germany filtered_names.o5m.pbf input, I saw this using standard osmium

[ 0:00] Additional strategy options:
[ 0:00]   - [types] relation types: multipolygon
[ 0:00]   - [complete-partial-relations] do not complete partial relations
[ 0:00]   - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 0:49] Second pass (of three)...
[ 1:15] Third pass (of three)...
[======================================================================] 100%
[ 1:34] Peak memory used: 170016 MBytes
[ 1:34] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool/build/osmium extract -v -c /home/alf/wahoo/google_test/germany_split.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered_names.o5m.pbf -s smart --overwrite"
        User time (seconds): 81.47
        System time (seconds): 50.15
        Percent of CPU this job got: 135%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37.19
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 168493376
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 21182108
        Voluntary context switches: 50747
        Involuntary context switches: 1024
        Swaps: 0
        File system inputs: 1726912
        File system outputs: 722312
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

And using Croaring

[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 1:25] Second pass (of three)...
[ 1:34] Third pass (of three)...
[======================================================================] 100%
[ 3:28] Peak memory used: 9538 MBytes
[ 3:28] Done.
        Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/germany_split.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered_names.o5m.pbf -s smart --overwrite"
        User time (seconds): 241.17
        System time (seconds): 1.98
        Percent of CPU this job got: 116%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:28.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 1974180
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 397267
        Voluntary context switches: 61810
        Involuntary context switches: 1301
        Swaps: 0
        File system inputs: 1726824
        File system outputs: 723240
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

So this was a lot quicker than when using the filtered.o5m.pbf input file. For this case, the CRoaring actually was faster for the phase 2. And it was really phase3 that was a lot slow for the CRoaring.

File size for Germany

1701943059 Oct  5 16:00 filtered.o5m.pbf
294752699 Oct  5 16:01 filtered_names.o5m.pbf

Version info

 /home/alf/devel/github/osmium-tool/build/osmium --version
osmium version 1.16.0 (v1.16.0)
libosmium version 2.19.0
Supported PBF compression types: none zlib lz4

Overall, I am surprised that the CPU usage is not higher, mainly 1-2 CPU cores being used, it seems like.

missinglink commented 1 year ago

Could you provide more details there, @missinglink ?

GitHub seems to be hiding that info in this comment under an arrow https://github.com/osmcode/osmium-tool/issues/258#issuecomment-1356450059

alfh commented 1 year ago

@missinglink I saw the performance numbers in your comment, but I wonder where the source code for the IdSetHybrid, I would like to test that source code on my test cases.

missinglink commented 1 year ago

It's linked in that comment

alfh commented 1 year ago

I tried today with the IdSetHybrid, taken from https://github.com/mmd-osm/Overpass-API/blob/test7591/src/overpass_api/data/abstract_processing.h#L38-L199. I actually used until line 226.

But for the "relations", the IdSetHybrid seems to be missing a begin and end method, since I get compilation error

/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp: In member function ‘virtual void strategy_smart::Strategy::run(osmium::util::VerboseOutput&, bool, const osmium::io::File&)’:
/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp:310:63: error: ‘begin’ was not declared in this scope
  310 |             for (const osmium::unsigned_object_id_type id : e.relation_ids) {

and

/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp:310:63: error: ‘end’ was not declared in this scope
  310 |             for (const osmium::unsigned_object_id_type id : e.relation_ids) {

I then tried using the IdSetHybrid for all except the relations, where I used the BufferedBitmask. The result was actually even less used memory than when only using BufferedBitmask, but the performance was slightly worse.

I see that CRoaring has the iterator methods that I assume are needed

inline RoaringSetBitForwardIterator Roaring::begin() const {
inline RoaringSetBitForwardIterator &Roaring::end() const {
inline Roaring64MapSetBitForwardIterator Roaring64Map::begin() const {
inline Roaring64MapSetBitForwardIterator Roaring64Map::end() const {

but the IdSetHybrid seems to be missing this.

@missinglink Did you have any issues with this ? Did you test using the "smart" strategy ?

missinglink commented 1 year ago

Sorry it was quite a while ago, I checked my old laptop but I don't have the code any longer.

I don't recall having any issues getting the IdSetHybrid code to compile, sorry I don't recall if I tested the smart strategy.

mmd-osm commented 1 year ago

but the IdSetHybrid seems to be missing this.

Yes, that's intentional, I didn't have any use for it, that's why I didn't implement the Iterator in IdSetHybrid. You could use the one from IdSetDense as a starting point. It is likely a non-trivial effort to get this to work.

alfh commented 1 year ago

Thanks for the feedback, I am looking at the CRoaring approach now and not on the IdSetHybrid anymore, since Croaring seems to have good potential, I seem to be able to speed up the phase3 by using some CRoaring functionality. I will also look into a couple of other ideas I have to speed up the performance.

Also, since the extract command does not seem to be very multithreaded, I also see that I can run many extract commands in paralell, so it seems like the Canada case could be done in a couple of hours (with 160GB RAM and 20 vCPU), instead of a couple of days, if I use about 800 extracts per "config file", instead of invoking the extract command for each of the 4500 tiles.

mmd-osm commented 1 year ago

For me, CRoaring would have introduced too much of a dependency. All I needed was a mix of IdSetSmall and IdSetDense, which switches from array to bitmap container when hitting a break even point. CRoaring does similar things (see https://arxiv.org/pdf/1603.06549.pdf page 6), yet adds further performance improvements like Run length encoding, and processor specific optimizations. Also, I only needed adding values to IdSetHybrid, and checking for their existence later on.

osmcode / osmium-tool