Closed ricadete closed 7 years ago
java -jar gazetteer-1.4.jar
will run with default settings. It depends on JDK version but it's something about one gigabyte of ram or two.
So first step, specify amount of memory:
java -Xmx4g -jar gazetteer-1.4.jar
Next step, how many execution threads do you have? Each one will takes about 0.5-1g of ram. (It's estimated average, some of them could take more)
So if you have 8 or 16 threads, strict join with number of threads
java -Xmx4g -jar gazetteer-1.4.jar --threads 2 join --handlers out-gazetteer $outFile
Good good morning,
so we have a vm with 15GB of ram and we ran the join as: java -Xmx10g -jar gazetteer-1.4.jar --threads 1 join --handlers out-gazetteer $outFile
it rans for hours and eventually gets stuck, does not throw any exception, it just stops. We also track the memory, it raises up to 11GB. Eventually I had to stop the process. Do you have any idea what is happening, maybe you can also check with this file? http://download.geofabrik.de/europe/netherlands-latest.osm.bz2
Best regards,
Ok, I'll check it. Not enough minerals.
Hi again!
So it finality did it, we had to increase the memory to 15GB, set single thread and wait around 14h. It seems that we may have a memory leak somewhere, the memory seems to be always increasing rather that your code goes split by split, that is suspicious. To help you I had my logs in attachment. Let me know if you find this useful for you.
FYI: The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.
These were the cmds: bzcat /opt/data/regions/netherlands-latest.osm.bz2 | java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-split-2015-11-24-22:38:45.log --data-dir netherlands split - none
java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-slice-2015-11-24-22:50:06.log --data-dir netherlands slice --x10
java -Xmx15g -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-join-2015-11-24-23:12:40.log --data-dir netherlands --threads 1 join --handlers out-gazetteer netherlands_2015-11-24-22:38:45.json.bz
netherlands-join-2015-11-24-23:12:40.txt netherlands-slice-2015-11-24-22:50:06.txt netherlands-split-2015-11-24-22:38:45.txt
split consumes memory from start to end and free mem at the end.
join should work with small pieces of data, so could you give me output of
ls -lh | grep stripe
The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.
It's actually a design issue, every line contains all data for address, with data for all address parts and can be processed line by line without fetching related objects. So all related parts of address inprinted into main feature. It takes a huge amount of space, but it have been done by purpose.
You could overwrite out-gazetteer handler with groovy script, to produce not so verbose output. Or use out-csv handler which produces much less verbose output. If it's the case, I could write an example of such handler.
the joins are small: between few KB to few MB stripe.txt
I think you have done a great job so far :) let me know if I can help you somehow.
Thank you, but it's still a lot of things to be done.
So as I understand, most of the time was taken by join?
Yes the join is really the bottleneck, if you check the logs the last steps really takes long time. There was nothing really happing in foreground, I would think I saw was the pid still running.
2015-11-25 06.59.42.555 [main] INFO JoinExecutor - Join stripes done in 7:47:01.702
2015-11-25 06.59.42.562 [main] INFO JoinBoundariesExecutor - Run join boundaries, with filter []
2015-11-25 06.59.48.480 [main] INFO JoinBoundariesExecutor - 2999 boundaries was sorted
2015-11-25 06.59.48.482 [main] INFO JoinBoundariesExecutor - Admin levels: [2, 3, 4, 6, 7, 8, 9, 10]
2015-11-25 07.00.05.797 [main] INFO JoinBoundariesExecutor - 0 boundaries skiped
2015-11-25 07.00.05.859 [main] INFO JoinBoundariesExecutor - Join boundaries done in 0:00:23.297
2015-11-25 07.00.05.859 [main] INFO JoinExecutor - Join boundaries done in 0:00:23.300
2015-11-25 12.30.30.31 [main] INFO GazetteerOutWriter - Wrote poi points: 277689
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote address points: 8701051
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway segments: 1139017
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway networks: 370605
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place boundaries: 0
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place points: 6502
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote admin boundaries: 2999
2015-11-25 12.30.30.55 [main] INFO JoinExecutor - All handlers done in 5:30:24.194
It's a good news actually, a kind of good news :)
So 5 hours 30 minutes was taken by sorting out the results. There are two things actually happens:
I've added some options to skip this part in last commit, I'll test it out and give you a note.
Try 1.5 https://github.com/kiselev-dv/gazetteer/releases/tag/Gazetteer-1.5 please
If you didn't delete --data-dir netherlands
folder just run it again with
java -Xmx10g -jar gazetteer-1.5.jar --log-file netherlands-join.log --data-dir netherlands --threads 1 join --handlers out-gazetteer out=netherlands.json.gz sort=NONE
Successfully convert Netherlands within 4 hours 6g of ram in two threads.
Good morning master,
I need your help once more, it seems that we need some tricks to resolve one of the most detailed country on OSM: Netherlands. So I ran the application as you suggested:
1st step bzcat $inputFile | java -jar gazetteer-1.4.jar split - none
2nd step java -jar gazetteer-1.4.jar slice --x10
3rd step java -jar gazetteer-1.4.jar join --handlers out-gazetteer $outFile
2015-11-20 10.01.17.187 [join-stripe18544.gjson.gz] ERROR JoinSliceRunable - Join failed. File: data/stripe18544.gjson.gz. java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:2367) ...
and there also more stripes failing after these one.
Source of the file: http://download.geofabrik.de/europe/netherlands-latest.osm.bz2
What can be done here? Thank you in advance.