Open missinglink opened 1 year ago
Digging a little deeper I see that the IDs are stored in an osmium::index::IdSetDense.
The implementation is using a bitmap under the hood, seems I assumed wrong there.
I'm not super familiar but roaring might provide an improvement, I believe there are three strategies they use depending on the total members in the set, but it might be negligible while also introducing a dependency.
I'm not a CPP programmer but it seems that this line is allocating a lot more memory than it really needs.
I might be reading this wrong but it seems that the size of m_data
is such that it can always fit the highest chunk_id
, which explains the documentation:
For the simple strategy it will at least be the number of extracts times the highest node ID used divided by 8
The issue with using a vector rather than using a map is that there are potentially large sections of the vector which are empty, is this consuming memory?
I've generated a custom PBF file which demonstrates the issue: osmium-oom.pbf (1.1KB)
When I run the following command all the RAM on my system is consumed:
osmium extract -s 'smart' -b '0,0,2,2' -o out.pbf --overwrite osmium-oom.pbf
The input file contains 73 elements, all of which have IDs under the max IDs currently available in OSM
Are you just playing around with this and had an idea for an improvement or do you actually have some problem using it with OSM data?
It seems the Bitmask implementation in osmium::index::IdSetDense
consumes a lot more memory than required, particularly for sparse sets.
I wanted to open a ticket to discuss before implementing anything, to see if it was worth putting in effort to implement a different Bitmask algorithm, and whether you would consider merging it.
It's hard to do a definitive comparison on implementation because they all perform differently in terms of memory consumption and CPU utilization depending on how sparse/dense the set is.
The main thing I would consider changing is that the Bitmask size correlates to the highest ID in the set rather than the population of the set.
Is there any reason why you're using a Vector instead of a Hashmap?
I assume this means "just playing around". Feel free to do that and implement something better. The reason the code is the way it is, is because it is simple and because it works for the stuff that people generally do. I am fully aware that it is not perfect and that it will not work for all use cases, but it is hard to write something that works in all cases, uses a minimum amount of memory, is performant etc.
If you come up with something better which is supported by tests, sure I'll consider it. If you want to introduce a new dependency, like RoaringBitmap, there is a good chance that I'll reject it, though. Adding dependencies comes with a lot of cost for users of the library, so it better be worth it. As a minimum it must be available for all platforms we support, which also means somewhat older Ubuntu versions etc. (If it can be optional that would be a possibility also.)
Is there any reason why you're using a Vector instead of a Hashmap?
I assume you are talking about IdSetSmall
? Because a sorted vector needs less memory than a hash and if the set is "small" the access time for a vector might even be better, it is certainly faster to create. (Especially true for std::unordered_map/set
which is rather slow compared to, for instance, the Abseil hash map (but see "dependencies" above.)
..it is hard to write something that works in all cases, uses a minimum amount of memory, is performant etc.
totally agree, this is why I would advocate using Roaring as it's well studied and battle-tested, rather than hacking something myself.
Adding dependencies comes with a lot of cost for users of the library
The CRoaring library offers an 'amalgamation' build, would you consider including this in libosmium if it was proven to have significant benefits over the existing bitmap implementation?
My CPP is terrible but I managed to butcher together a branch for testing here: https://github.com/osmcode/osmium-tool/compare/master...missinglink:osmium-tool:roaring?expand=1
The comment underneath details is no longer relevant for the problem at hand…
I've some some basic benchmarking of master
/roaring
/IdSetHybrid
which provides some insights:
IdSetDense
implementation in master is optimized for speed but not for memory consumptionroaring
implementation uses 20% of the memory consumed by IdSetDense
. IdSetHybrid
uses ~30%.IdSetHybrid
implementation is on par with IdSetDense
in terms of speed. roaring
is ~20% slower, even with a write buffer.The reduced memory requirement was very useful for me when cutting down the planet file to smaller extracts, as per Geofabrik downloads. Previously the machine would go to swap with 128GB RAM, it now uses ~20GB.
I think it would be nice to have a default bitmask implementation for osmium extract
which allowed it to run on modest hardware, the IdSetHybrid
implementation provides a ~70% reduction in memory usage with no performance penalty.
When I run the following command all the RAM on my system is consumed:
osmium extract -s 'smart' -b '0,0,2,2' -o out.pbf --overwrite osmium-oom.pbf
The reason why your example file takes so much memory is the single node with a negative node id:
<node id="-1" version="1" changeset="1" lat="1" lon="1"/>
If you change that node from e.g. -1 to 2 in the file, the memory consumption is insignificant. IdSetDense works with unsigned integers, so -1 will end up as a fairly large positive value (0xFFFFFFFFFFFFFFFF).
I'd say, this issue is probably a duplicate of #234, or at least a related issue. Also the question comes up if this is a realistic test case in the first place. Planet files don't contain any negative object ids.
Hi
I've read the interesting discussion here.
I have one case where I am investigating the osmium-tool extract performance. I am using the https://github.com/treee111/wahooMapsCreator tool to create the files used for navigation on the Wahoo bicycle computers, using the lastest OpenStreetMap map data.
To generate the files for Norway, it takes about 10200 seconds, and the osmium extract time is about 7800 of these seconds. The tool breaks Norway up into 1176 tiles.
My initial investigation showed that it looked promising to make one json input file to osmium-tool extract, and have it create the 1176 output files, instead of invoking the osmium-tool extract 1176 times, one for each tile, which is what the wahooMapsCreator tool is currently doing. Any thoughts on this ?
My initial test based on 30 tiles, was that I could decrease the time used by 70%, by invoking osmium tool once instead of 30 times. But then I ran into memory issues when using more input, having 64GB RAM on my home computer, and also testing on a cloud VM with 128GB RAM.
So I looked at this discussion, and made some tests. I took the https://github.com/missinglink/osmium-tool fork, the roaring branch, and applied the changes for using CRoaring to the latest osmium-tool master branch, and also updated the CRoaring to the latest version. I also took away the newly added check that no more than 500 extracts was present, since I just upped the ulimit to 2048 on linux.
The CRoaring also looks interesting since they are working on adding AVX2 and AVX512 support, but I am not sure if that would eventually help on the performance, https://github.com/RoaringBitmap/CRoaring/issues/454.
I observe the similar things as @missinglink , which is that using the CRoaring really reduces the memory usage sharply, but the performance also goes down. But I see missinglink mentions "IdSetHybrid" implementation. Could you provide more details there, @missinglink ? Perhaps push the branch to your github fork, @missinglink ?
It would be really interesting to test how the memory usage / performance would be for my case using that hybrid approach. I think I could save 1 of the 2 hours used for the extraction for Norway with such a change.
I also notice that the CPU usage is around 100% on a 16 core machine, i.e. only one core is really used. Is that as expected ?
The memory usage using the CRoaring for Norway was around 6GB (shown by htop) (but virtual memory usage was 86GB). When using the normal osmium-tool, it ran out of memory as soon as phase 2 started (the machine has 128GB RAM)
Here is the "/usr/bin/time -v" output on linux for my run for the "filtered_names" input file.
[ 0:00] Additional strategy options:
[ 0:00] - [types] relation types: multipolygon
[ 0:00] - [complete-partial-relations] do not complete partial relations
[ 0:00] - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[14:13] Second pass (of three)...
[15:03] Third pass (of three)...
[======================================================================] 100%
[38:50] Peak memory used: 86331 MBytes
[38:50] Done.
Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/norway_split.json /opt/data/alf/wahooMapsDisk/_tiles/norway/filtered_names.o5m.pbf -s smart --overwrite"
User time (seconds): 2387.40
System time (seconds): 4.80
Percent of CPU this job got: 102%
Elapsed (wall clock) time (h:mm:ss or m:ss): 38:50.55
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 4023352
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 60
Minor (reclaiming a frame) page faults: 794752
Voluntary context switches: 333693
Involuntary context switches: 11170
Swaps: 0
File system inputs: 3938680
File system outputs: 1500920
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And here for the filtered input file, with example of how the size of the tiles are.
[ 0:00] [1174] Output: /tmp/alf/norway/_tiles/153/29/split.osm.pbf
[ 0:00] Format: PBF
[ 0:00] Description:
[ 0:00] Envelope: (35.15625,79.6871842,36.5625,79.9359182)
[ 0:00] Type: bbox
[ 0:00] Geometry: BOX(35.15625 79.6871842,36.5625 79.9359182)
[ 0:00] [1175] Output: /tmp/alf/norway/_tiles/153/30/split.osm.pbf
[ 0:00] Format: PBF
[ 0:00] Description:
[ 0:00] Envelope: (35.15625,79.4323708,36.5625,79.6871842)
[ 0:00] Type: bbox
[ 0:00] Geometry: BOX(35.15625 79.4323708,36.5625 79.6871842)
[ 0:00] [1176] Output: /tmp/alf/norway/_tiles/153/31/split.osm.pbf
[ 0:00] Format: PBF
[ 0:00] Description:
[ 0:00] Envelope: (35.15625,79.1713346,36.5625,79.4323708)
[ 0:00] Type: bbox
[ 0:00] Geometry: BOX(35.15625 79.1713346,36.5625 79.4323708)
[ 0:00]
[ 0:00] Additional strategy options:
[ 0:00] - [types] relation types: multipolygon
[ 0:00] - [complete-partial-relations] do not complete partial relations
[ 0:00] - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 8:03] Second pass (of three)...
[ 8:29] Third pass (of three)...
[======================================================================] 100%
[20:04] Peak memory used: 85388 MBytes
[20:04] Done.
Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/norway.json /opt/data/alf/wahooMapsDisk/_tiles/norway/filtered.o5m.pbf -s smart --overwrite"
User time (seconds): 1237.67
System time (seconds): 3.28
Percent of CPU this job got: 103%
Elapsed (wall clock) time (h:mm:ss or m:ss): 20:04.70
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3007120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 554233
Voluntary context switches: 182617
Involuntary context switches: 5775
Swaps: 0
File system inputs: 2435536
File system outputs: 859352
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
File sizes for Norway, although I appreciate that it is really the highest node id value to influences max memory requirement.
filtered.o5m.pbf
filtered_names.o5m.pbf
So total time is about 3500 seconds, which is even a lot better than the 7700 seconds (although on a different machine, but CPUs are quite similar.)
I am also doing testing on Germany and France, which both are broken up into less than 100 tiles. But there are still memory issues there. I will come back with some results there, just taking 50 tiles, which I think will work for both the standard osmium-tool and the variant with CRoaring. I also have a more extreme case with Canada, with over 4500 tiles that I will look into.
I then tried with another cloud machine, with 176 GB RAM. It ran out of memory just after starting phase 2 for Norway.
For Germany, I could complete one test with standard osmium and one with the croaring implementation. Standard osmium
[ 0:00] [58] Output: /tmp/alf/germany/_tiles/138/84/split.osm.pbf
[ 0:00] Format: PBF
[ 0:00] Description:
[ 0:00] Envelope: (14.0625,51.6180165,15.46875,52.4827802)
[ 0:00] Type: bbox
[ 0:00] Geometry: BOX(14.0625 51.6180165,15.46875 52.4827802)
[ 0:00] [59] Output: /tmp/alf/germany/_tiles/138/85/split.osm.pbf
[ 0:00] Format: PBF
[ 0:00] Description:
[ 0:00] Envelope: (14.0625,50.7364551,15.46875,51.6180165)
[ 0:00] Type: bbox
[ 0:00] Geometry: BOX(14.0625 50.7364551,15.46875 51.6180165)
[ 0:00]
[ 0:00] Additional strategy options:
[ 0:00] - [types] relation types: multipolygon
[ 0:00] - [complete-partial-relations] do not complete partial relations
[ 0:00] - [tags] no tags defined
[ 0:00]
[ 0:01] Running 'smart' strategy in three passes...
[ 0:01] First pass (of three)...
[ 2:27] Second pass (of three)...
[ 3:04] Third pass (of three)...
[======================================================================] 100%
[ 4:16] Peak memory used: 169865 MBytes
[ 4:16] Done.
Command being timed: "/home/alf/devel/github/osmium-tool/build/osmium extract -v -c /home/alf/wahoo/google_test/germany.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered.o5m.pbf -s smart --overwrite"
User time (seconds): 385.59
System time (seconds): 53.86
Percent of CPU this job got: 169%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:18.61
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 168580104
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 21447621
Voluntary context switches: 189366
Involuntary context switches: 2665
Swaps: 0
File system inputs: 9972072
File system outputs: 3298808
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Osmium modified to use CRoaring
[ 0:00] Additional strategy options:
[ 0:00] - [types] relation types: multipolygon
[ 0:00] - [complete-partial-relations] do not complete partial relations
[ 0:00] - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[12:38] Second pass (of three)...
[13:57] Third pass (of three)...
[======================================================================] 100%
[23:19] Peak memory used: 10100 MBytes
[23:19] Done.
Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/germany.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered.o5m.pbf -s smart --overwrite"
User time (seconds): 1575.75
System time (seconds): 6.33
Percent of CPU this job got: 112%
Elapsed (wall clock) time (h:mm:ss or m:ss): 23:20.08
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3234472
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 29
Minor (reclaiming a frame) page faults: 722442
Voluntary context switches: 298579
Involuntary context switches: 7287
Swaps: 0
File system inputs: 9977016
File system outputs: 3304032
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
So for this input file, the croaring impl was actually 6 times slower, but used 95% less memory.
For the Germany filtered_names.o5m.pbf input, I saw this using standard osmium
[ 0:00] Additional strategy options:
[ 0:00] - [types] relation types: multipolygon
[ 0:00] - [complete-partial-relations] do not complete partial relations
[ 0:00] - [tags] no tags defined
[ 0:00]
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 0:49] Second pass (of three)...
[ 1:15] Third pass (of three)...
[======================================================================] 100%
[ 1:34] Peak memory used: 170016 MBytes
[ 1:34] Done.
Command being timed: "/home/alf/devel/github/osmium-tool/build/osmium extract -v -c /home/alf/wahoo/google_test/germany_split.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered_names.o5m.pbf -s smart --overwrite"
User time (seconds): 81.47
System time (seconds): 50.15
Percent of CPU this job got: 135%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:37.19
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 168493376
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 21182108
Voluntary context switches: 50747
Involuntary context switches: 1024
Swaps: 0
File system inputs: 1726912
File system outputs: 722312
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
And using Croaring
[ 0:00] Running 'smart' strategy in three passes...
[ 0:00] First pass (of three)...
[ 1:25] Second pass (of three)...
[ 1:34] Third pass (of three)...
[======================================================================] 100%
[ 3:28] Peak memory used: 9538 MBytes
[ 3:28] Done.
Command being timed: "/home/alf/devel/github/osmium-tool-alf/build/osmium extract -v -c /home/alf/wahoo/google_test/germany_split.json /opt/data/alf/wahooMapsDisk/_tiles/germany/filtered_names.o5m.pbf -s smart --overwrite"
User time (seconds): 241.17
System time (seconds): 1.98
Percent of CPU this job got: 116%
Elapsed (wall clock) time (h:mm:ss or m:ss): 3:28.71
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1974180
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 397267
Voluntary context switches: 61810
Involuntary context switches: 1301
Swaps: 0
File system inputs: 1726824
File system outputs: 723240
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
So this was a lot quicker than when using the filtered.o5m.pbf input file. For this case, the CRoaring actually was faster for the phase 2. And it was really phase3 that was a lot slow for the CRoaring.
File size for Germany
1701943059 Oct 5 16:00 filtered.o5m.pbf
294752699 Oct 5 16:01 filtered_names.o5m.pbf
Version info
/home/alf/devel/github/osmium-tool/build/osmium --version
osmium version 1.16.0 (v1.16.0)
libosmium version 2.19.0
Supported PBF compression types: none zlib lz4
Overall, I am surprised that the CPU usage is not higher, mainly 1-2 CPU cores being used, it seems like.
Could you provide more details there, @missinglink ?
GitHub seems to be hiding that info in this comment under an arrow https://github.com/osmcode/osmium-tool/issues/258#issuecomment-1356450059
@missinglink I saw the performance numbers in your comment, but I wonder where the source code for the IdSetHybrid, I would like to test that source code on my test cases.
It's linked in that comment
I tried today with the IdSetHybrid, taken from https://github.com/mmd-osm/Overpass-API/blob/test7591/src/overpass_api/data/abstract_processing.h#L38-L199. I actually used until line 226.
But for the "relations", the IdSetHybrid seems to be missing a begin and end method, since I get compilation error
/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp: In member function ‘virtual void strategy_smart::Strategy::run(osmium::util::VerboseOutput&, bool, const osmium::io::File&)’:
/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp:310:63: error: ‘begin’ was not declared in this scope
310 | for (const osmium::unsigned_object_id_type id : e.relation_ids) {
and
/home/alf/devel/github/osmium-tool-hybrid/src/extract/strategy_smart.cpp:310:63: error: ‘end’ was not declared in this scope
310 | for (const osmium::unsigned_object_id_type id : e.relation_ids) {
I then tried using the IdSetHybrid for all except the relations, where I used the BufferedBitmask. The result was actually even less used memory than when only using BufferedBitmask, but the performance was slightly worse.
I see that CRoaring has the iterator methods that I assume are needed
inline RoaringSetBitForwardIterator Roaring::begin() const {
inline RoaringSetBitForwardIterator &Roaring::end() const {
inline Roaring64MapSetBitForwardIterator Roaring64Map::begin() const {
inline Roaring64MapSetBitForwardIterator Roaring64Map::end() const {
but the IdSetHybrid seems to be missing this.
@missinglink Did you have any issues with this ? Did you test using the "smart" strategy ?
Sorry it was quite a while ago, I checked my old laptop but I don't have the code any longer.
I don't recall having any issues getting the IdSetHybrid code to compile, sorry I don't recall if I tested the smart strategy.
but the IdSetHybrid seems to be missing this.
Yes, that's intentional, I didn't have any use for it, that's why I didn't implement the Iterator in IdSetHybrid. You could use the one from IdSetDense as a starting point. It is likely a non-trivial effort to get this to work.
Thanks for the feedback, I am looking at the CRoaring approach now and not on the IdSetHybrid anymore, since Croaring seems to have good potential, I seem to be able to speed up the phase3 by using some CRoaring functionality. I will also look into a couple of other ideas I have to speed up the performance.
Also, since the extract command does not seem to be very multithreaded, I also see that I can run many extract commands in paralell, so it seems like the Canada case could be done in a couple of hours (with 160GB RAM and 20 vCPU), instead of a couple of days, if I use about 800 extracts per "config file", instead of invoking the extract command for each of the 4500 tiles.
For me, CRoaring would have introduced too much of a dependency. All I needed was a mix of IdSetSmall and IdSetDense, which switches from array to bitmap container when hitting a break even point. CRoaring does similar things (see https://arxiv.org/pdf/1603.06549.pdf page 6), yet adds further performance improvements like Run length encoding, and processor specific optimizations. Also, I only needed adding values to IdSetHybrid, and checking for their existence later on.
What version of osmium-tool are you using?
What operating system version are you using?
Tell us something about your system
64GB RAM
What did you do exactly?
Cutting many extracts uses a large amount of memory (as documented).
What did you expect to happen?
I expected the internal memory requirement to be lower, assuming that a bitmap structure was being used to track dependent OSM elements.
What did happen instead?
It seems that element IDs are being stored as integers?
What did you do to try analyzing the problem?
I would just like to enquire if something like https://roaringbitmap.org/ could be used to keep track of which elements are required in each extract, in order to greatly reduce the memory requirements?