kiselev-dv / gazetteer

OSM ElasticSearch geocoder and addresses exporter
http://osm.me
Other
99 stars 21 forks source link

Netherlands - Out of memory #39

Closed ricadete closed 7 years ago

ricadete commented 8 years ago

Good morning master,

I need your help once more, it seems that we need some tricks to resolve one of the most detailed country on OSM: Netherlands. So I ran the application as you suggested:

1st step bzcat $inputFile | java -jar gazetteer-1.4.jar split - none

2nd step java -jar gazetteer-1.4.jar slice --x10

3rd step java -jar gazetteer-1.4.jar join --handlers out-gazetteer $outFile

2015-11-20 10.01.17.187 [join-stripe18544.gjson.gz] ERROR JoinSliceRunable - Join failed. File: data/stripe18544.gjson.gz. java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:2367) ...

and there also more stripes failing after these one.

Source of the file: http://download.geofabrik.de/europe/netherlands-latest.osm.bz2

What can be done here? Thank you in advance.

kiselev-dv commented 8 years ago

java -jar gazetteer-1.4.jar will run with default settings. It depends on JDK version but it's something about one gigabyte of ram or two.

So first step, specify amount of memory:

java -Xmx4g -jar gazetteer-1.4.jar 

Next step, how many execution threads do you have? Each one will takes about 0.5-1g of ram. (It's estimated average, some of them could take more)

So if you have 8 or 16 threads, strict join with number of threads

 java -Xmx4g -jar gazetteer-1.4.jar --threads 2 join --handlers out-gazetteer $outFile
ricadete commented 8 years ago

Good good morning,

so we have a vm with 15GB of ram and we ran the join as: java -Xmx10g -jar gazetteer-1.4.jar --threads 1 join --handlers out-gazetteer $outFile

it rans for hours and eventually gets stuck, does not throw any exception, it just stops. We also track the memory, it raises up to 11GB. Eventually I had to stop the process. Do you have any idea what is happening, maybe you can also check with this file? http://download.geofabrik.de/europe/netherlands-latest.osm.bz2

Best regards,

kiselev-dv commented 8 years ago

Ok, I'll check it. Not enough minerals.

ricadete commented 8 years ago

Hi again!

So it finality did it, we had to increase the memory to 15GB, set single thread and wait around 14h. It seems that we may have a memory leak somewhere, the memory seems to be always increasing rather that your code goes split by split, that is suspicious. To help you I had my logs in attachment. Let me know if you find this useful for you.

FYI: The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

These were the cmds: bzcat /opt/data/regions/netherlands-latest.osm.bz2 | java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-split-2015-11-24-22:38:45.log --data-dir netherlands split - none

java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-slice-2015-11-24-22:50:06.log --data-dir netherlands slice --x10

java -Xmx15g -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-join-2015-11-24-23:12:40.log --data-dir netherlands --threads 1 join --handlers out-gazetteer netherlands_2015-11-24-22:38:45.json.bz

netherlands-join-2015-11-24-23:12:40.txt netherlands-slice-2015-11-24-22:50:06.txt netherlands-split-2015-11-24-22:38:45.txt

kiselev-dv commented 8 years ago

split consumes memory from start to end and free mem at the end.

join should work with small pieces of data, so could you give me output of

ls -lh | grep stripe

The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

It's actually a design issue, every line contains all data for address, with data for all address parts and can be processed line by line without fetching related objects. So all related parts of address inprinted into main feature. It takes a huge amount of space, but it have been done by purpose.

You could overwrite out-gazetteer handler with groovy script, to produce not so verbose output. Or use out-csv handler which produces much less verbose output. If it's the case, I could write an example of such handler.

ricadete commented 8 years ago

the joins are small: between few KB to few MB stripe.txt

I think you have done a great job so far :) let me know if I can help you somehow.

kiselev-dv commented 8 years ago

Thank you, but it's still a lot of things to be done.

So as I understand, most of the time was taken by join?

ricadete commented 8 years ago

Yes the join is really the bottleneck, if you check the logs the last steps really takes long time. There was nothing really happing in foreground, I would think I saw was the pid still running.

2015-11-25 06.59.42.555 [main] INFO JoinExecutor - Join stripes done in 7:47:01.702
2015-11-25 06.59.42.562 [main] INFO JoinBoundariesExecutor - Run join boundaries, with filter []
2015-11-25 06.59.48.480 [main] INFO JoinBoundariesExecutor - 2999 boundaries was sorted
2015-11-25 06.59.48.482 [main] INFO JoinBoundariesExecutor - Admin levels: [2, 3, 4, 6, 7, 8, 9, 10]
2015-11-25 07.00.05.797 [main] INFO JoinBoundariesExecutor - 0 boundaries skiped
2015-11-25 07.00.05.859 [main] INFO JoinBoundariesExecutor - Join boundaries done in 0:00:23.297
2015-11-25 07.00.05.859 [main] INFO JoinExecutor - Join boundaries done in 0:00:23.300
2015-11-25 12.30.30.31 [main] INFO GazetteerOutWriter - Wrote poi points: 277689
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote address points: 8701051
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway segments: 1139017
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway networks: 370605
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place boundaries: 0
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place points: 6502
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote admin boundaries: 2999
2015-11-25 12.30.30.55 [main] INFO JoinExecutor - All handlers done in 5:30:24.194
kiselev-dv commented 8 years ago

It's a good news actually, a kind of good news :)

https://github.com/kiselev-dv/gazetteer/blob/develop/Gazetteer/src/main/java/me/osm/gazetter/join/out_handlers/GazetteerOutWriter.java#L969

So 5 hours 30 minutes was taken by sorting out the results. There are two things actually happens:

  1. sort features with hierarchy (referenced features before features which uses dependancy)
  2. merge highways into networks (to find out one highway instead of tons of small segments)

I've added some options to skip this part in last commit, I'll test it out and give you a note.

kiselev-dv commented 8 years ago

Try 1.5 https://github.com/kiselev-dv/gazetteer/releases/tag/Gazetteer-1.5 please If you didn't delete --data-dir netherlands folder just run it again with

java -Xmx10g -jar gazetteer-1.5.jar --log-file netherlands-join.log --data-dir netherlands --threads 1 join --handlers out-gazetteer out=netherlands.json.gz sort=NONE

Successfully convert Netherlands within 4 hours 6g of ram in two threads.