noi-techpark / odh-mentor-otp

4 stars 8 forks source link

Resources needed from OTP #122

Closed RudiThoeni closed 2 years ago

RudiThoeni commented 2 years ago

Hi

I wanted to ask what is your recommendation (RAM) for the OTP Application?

I am asking this because i constantly increased the resources, and now it seems that i have to do it again...... Actually on the test/prod server we use servers with 16GB of ram (on the test server about 80% of ram are free for otp)

I set the ram for java to use with the JAVA_MX environment variable..... At the beginning here there were set 2 GB ;) now i increased up do 16GB........

As far as i understood The ram are needed by starting the backend application which reads the graph.......

Trying around i noticed on the test server following With Java_MX set to 12GB CPU goes to 99% and after x minutes (server begins to swap)

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
otp_1         |         at com.esotericsoftware.kryo.io.Input.readDoubles(Input.java:885)

With Java_MX set to 16GB today i got

otp_1         | 08:28:12.010 ERROR (InputStreamGraphSource.java:181) Exception while loading graph 'openmove'.
otp_1         | com.esotericsoftware.kryo.KryoException: Buffer underflow.

so what is your recommendation here, do i have to scale up using more ram? or could something be done..... As far as i understood the high amount of RAM/CPU is only needed by starting the backend application which reads the graph?

i am not a java expert, maybe you have some hints, optimizations in mind? thanks

RudiThoeni commented 2 years ago

more info:

otp_1         | 09:54:27.007 INFO (Graph.java:823) This graph was built with the currently running version and commit of OTP.
otp_1         | 09:58:06.171 ERROR (InputStreamGraphSource.java:181) Exception while loading graph 'openmove'.
otp_1         | com.esotericsoftware.kryo.KryoException: Buffer underflow.
otp_1         | Serialization trace:
otp_1         | edges (org.opentripplanner.routing.edgetype.SimpleTransfer)
otp_1         |         at com.esotericsoftware.kryo.io.Input.require(Input.java:199) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.io.Input.readVarInt(Input.java:373) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:145) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:134) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:40) ~[otp-unofficial.jar:1.1]
otp_1         |         at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813) ~[otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.routing.graph.Graph.load(Graph.java:775) ~[otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.routing.impl.InputStreamGraphSource.loadGraph(InputStreamGraphSource.java:179) [otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.routing.impl.InputStreamGraphSource.reload(InputStreamGraphSource.java:103) [otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.routing.services.GraphService.registerGraph(GraphService.java:183) [otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.routing.impl.GraphScanner.startup(GraphScanner.java:69) [otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.standalone.OTPMain.run(OTPMain.java:131) [otp-unofficial.jar:1.1]
otp_1         |         at org.opentripplanner.standalone.OTPMain.main(OTPMain.java:74) [otp-unofficial.jar:1.1]
otp_1         | 09:58:06.173 WARN (InputStreamGraphSource.java:114) Unable to load data for router 'openmove'.
zabuTNT commented 2 years ago

Hi Rudi, I'll investigate on this. This great use of memory seems strange. It is predictable when the graph is built, not when it is loaded. Are you sure that you don't run multiple instances? Are you sure that the memory is not occupied by other processes?

How big is you Graph.obj ? I should be around 200-300MB.

In order to save some memory we can cache the elevation data elaborated on build (so in the following builds you skip that elaboration). But this is on build, not on run.

To give you an idea: we have (in production) an OTP instance with 2 graphs: Trentino area with Trentino Trasporti GTFS and Veneto Region with Trenitalia + ACTV (Venezia) + ATV (Verona) and JAVA_MX is 3GB when running.

RudiThoeni commented 2 years ago

Hi @zabuTNT Thanks for your answer.... My impression was always that with every deployment, the memory resources used increased........

There are not multiple instances running and the server when not deploying has not much resources used (30% memori 5-10% cpu)

You indicated the right place, graph.obj i think the problem is here ;) image The graph obj is 8 GB....... i didn't know of the size of 200 -300mb.....

So i think there is something wrong with our calculation pipeline with every graph recalculation it gets bigger...

RudiThoeni commented 2 years ago

i think the problem is in the graph calculation https://github.com/noi-techpark/odh-mentor-otp/blob/master/infrastructure/docker/otp/docker-entrypoint.sh#L100

In the graph calculation pipeline this is called otp.sh --build /data which has this inside exec java -Xmx"$JAVA_MX" -jar /usr/local/share/java/otp.jar "$@"

seems that with a Graph.obj of this size on loading the application goes out of memory.....

maybe you have a hint where i can look at (since i am new to this project it's not so easy for me to understand all whats going on here ;) but this size of the obj file for me seems to big?

many thx

RudiThoeni commented 2 years ago

i deleted the Graph.obj file and restarted the graph calculation but it reached again filesize of 7.7 GB

/data/openmove # ls -lha
total 8G
drwxr-xr-x    2 root     root        6.0K Dec  9 15:40 .
d-wxrw--wt    3 root     root        6.0K Dec  9 15:40 ..
-rw-r--r--    1 root     root        7.7G Dec  9 15:38 Graph.obj

I append the Graph calculation log, also there at the end an out of memory was thrown.... logs.txt

zabuTNT commented 2 years ago

Ok 7GB is really strange. The size of a Graph depends on the the static data that OTP find in the folder when it builds the graph. So please check how many maps (.osm.pbf), GTFS (zips) and altitude data (.tif) you have in the folder /data.

Most of the time, a big graph issue is caused by a lot of maps or a single one but really big (Europe maybe)

For example. Trentino Suedtirol is around 200MB. Austria 600MB. Italy is around 2.5GB.

RudiThoeni commented 2 years ago

ok that explains more ;)

currently in the data folder we have this data @rcavaliere how can we now if there is unneeded/duplicate data inside? If we need all this data and the graph file has to be this size there is no other way to scale up

/data # ls -lha
total 4G
-rw-------    1 root     root        3.0G Oct 28 15:14 all.osm
-rw-r--r--    1 root     root         124 Dec  9 14:45 build-config.json
-rw-r--r--    1 root     root       84.5M Nov 10 02:00 gtfs_1eAOMCiOJlgAQmLztZLYKeFQi3n169i7.zip
-rw-r--r--    1 root     root       84.5M Nov  4 02:00 gtfs_3hjEyr63xqElLlYpIs4FuHPzabAPeYqZ.zip
-rw-r--r--    1 root     root       84.5M Nov  5 02:00 gtfs_Cx9eWsuf8M2venSy1uxZWU73Jvvd63FO.zip
-rw-r--r--    1 root     root       84.5M Nov 12 02:00 gtfs_NyfyQ8mGZEhJ9wu2PDb6rTMJm4RLUurC.zip
-rw-r--r--    1 root     root       84.5M Dec  4 02:00 gtfs_ZHB3JRr4tgkTX9J1Ky8JWElSlrQNJ1LC.zip
-rw-r--r--    1 root     root       84.5M Oct 28 15:14 gtfs_aJ7IxZFFb6MHIUWHXvQy1R2K1gJLa8Il.zip
-rw-r--r--    1 root     root       84.5M Nov 11 02:00 gtfs_bqhcS6TjLJZSGjeuis6Bx4kC9JI4tCEB.zip
-rw-r--r--    1 root     root       84.5M Nov 20 02:00 gtfs_dj7JIjVA6j0Vq4Q5l5pATJFclxcCHRcb.zip
-rw-r--r--    1 root     root       29.0K Dec  9 02:00 gtfs_download.log
-rw-r--r--    1 root     root       84.5M Nov 24 02:00 gtfs_e6AKpGW8KsmNSEhIxuwh19R5TWuNlP7H.zip
-rw-r--r--    1 root     root       84.5M Oct 28 15:14 gtfs_ffWJ41WwZHFqr4Vb3VPHCOa5XEFkTFbE.zip
-rw-r--r--    1 root     root       84.5M Nov 18 02:00 gtfs_hr5lKvMNlrFy1klPWIvgRNfGUnCORqDk.zip
-rw-r--r--    1 root     root       84.5M Nov 19 02:00 gtfs_xmgBYYM0ItAj0q4Fdxz4q3GJZqoABSLj.zip
-rw-r--r--    1 root     root       52.0M Oct 28 15:14 latestGTFS.zip
drwxr-xr-x    2 root     root        6.0K Dec  9 15:40 openmove
-rw-r--r--    1 root     root        2.1K Oct 28 15:15 osm.url
-rw-r--r--    1 root     root        1.1K Dec  9 14:45 router-config.json
-rw-r--r--    1 root     root       68.8M Oct 28 15:15 srtm_39_03.tif
-rw-r--r--    1 root     root       35.9M Oct 28 15:15 srtm_39_03.zip

on production we have this data

/data # ls -lha
total 4G
-rw-------    1 root     root        3.0G Oct 28 13:43 all.osm
-rw-r--r--    1 root     root         124 Dec  1 13:09 build-config.json
-rw-r--r--    1 root     root       84.5M Nov  5 04:00 gtfs_7Nqcnlf1ypsogLG1Lix05WOHigggI4qD.zip
-rw-r--r--    1 root     root       84.5M Oct 29 04:00 gtfs_KIxgBMrgJUAZ4S1uwgrv5nePpyww2oZ1.zip
-rw-r--r--    1 root     root       84.5M Oct 30 04:00 gtfs_LzK3ug7pKW14jgRlopSmD8q5b3LIbEpX.zip
-rw-r--r--    1 root     root       84.5M Nov 20 04:00 gtfs_MvtCoL238r5WatbEkylDQR4AT1sfBNNn.zip
-rw-r--r--    1 root     root       84.5M Nov 11 04:00 gtfs_XR2BSraoMVxE6FTBI7AtTlTMn6PUjnE1.zip
-rw-r--r--    1 root     root       84.5M Oct 28 13:43 gtfs_aJ7IxZFFb6MHIUWHXvQy1R2K1gJLa8Il.zip
-rw-r--r--    1 root     root       84.5M Nov 24 04:00 gtfs_cZBfr8tGfWWLqBQ7clQlEReHVy5fNXPI.zip
-rw-r--r--    1 root     root       84.5M Dec  4 04:00 gtfs_cyybp8lOEeFlSI87qJyXs7vPM4oaXwkr.zip
-rw-r--r--    1 root     root       39.8K Dec  9 04:00 gtfs_download.log
-rw-r--r--    1 root     root       84.5M Oct 28 13:43 gtfs_ffWJ41WwZHFqr4Vb3VPHCOa5XEFkTFbE.zip
-rw-r--r--    1 root     root       84.5M Nov 12 04:00 gtfs_sxZPnoEwnVlNtaQ86Ct80cmuHfSB1y9l.zip
-rw-r--r--    1 root     root       52.0M Oct 28 13:43 latestGTFS.zip
drwxr-xr-x    2 root     root        6.0K Dec  1 13:26 openmove
-rw-r--r--    1 root     root        2.1K Oct 28 13:43 osm.url
-rw-r--r--    1 root     root        1.1K Dec  1 13:09 router-config.json
-rw-r--r--    1 root     root       68.8M Oct 28 13:43 srtm_39_03.tif
-rw-r--r--    1 root     root       35.9M Oct 28 13:43 srtm_39_03.zip

the graph there is about 6 GB ;)

RudiThoeni commented 2 years ago

@rcavaliere could it be that there are to many gtfs files?

examining gtfs_download.log this is the output..... maybe here we forgot to delete the old gtfs file? That could be the explanation why this graph gets heavier with each day passes ;)

[11/30/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Nov 30 02:00 /tmp/gtfs_wrnWQWFHHn3U4ABK0UC9fQUuu9PEK3Cl.zip
[11/30/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[11/30/21_02:00:00] gtfs not changed!

[12/01/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  1 02:00 /tmp/gtfs_4tnthTx3gxD3zNkGVcnh9ylZfnS33xHp.zip
[12/01/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/01/21_02:00:00] gtfs not changed!

[12/02/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  2 02:00 /tmp/gtfs_vEcv5AhhjX3bcRVytGgzw2kC7SpdY7db.zip
[12/02/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/02/21_02:00:00] gtfs not changed!

[12/03/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  3 02:00 /tmp/gtfs_QWY7i7nUrMz3lbbSekGivjD4kWaoPeMo.zip
[12/03/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/03/21_02:00:00] gtfs not changed!

[12/04/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  4 02:00 /tmp/gtfs_ZHB3JRr4tgkTX9J1Ky8JWElSlrQNJ1LC.zip
[12/04/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/04/21_02:00:00] run rebuild hook...
hook http response: 201

[12/05/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  5 02:00 /tmp/gtfs_pltfIlmfDw3c5l3d5TP3Qp3BJ1NTERRi.zip
[12/05/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/05/21_02:00:00] gtfs not changed!

[12/06/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  6 02:00 /tmp/gtfs_1bueFzfBJRRaiiB27VI1ECmjOsJFzNuY.zip
[12/06/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/06/21_02:00:00] gtfs not changed!

[12/07/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  7 02:00 /tmp/gtfs_CLG8JUeyyZddxoA3e4yqKHpwb3k7rygi.zip
[12/07/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/07/21_02:00:00] gtfs not changed!

[12/08/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  8 02:00 /tmp/gtfs_KEE3s9r33kf05HgyBAvU0taHs7zJ20ai.zip
[12/08/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/08/21_02:00:00] gtfs not changed!

[12/09/21_02:00:00] Download new gtfs and checksum...
-rw-r--r--    1 root     root      88634121 Dec  9 02:00 /tmp/gtfs_AwOZnNAPnr1PcNY1vKeWohienfDtJlXG.zip
[12/09/21_02:00:00] new checksum e7d4d89e1dc6342c8518fc4434efd9645c1fb191fca422796cab1f974426f068
[12/09/21_02:00:00] gtfs not changed!
zabuTNT commented 2 years ago

Yes, I think that the issue here is that the old GTFS files are not deleted. This way every new gtfs published by STA, every build, you have more and more repeated data inside the graph.

The map file is 3GB but this could be ok because is not compressed or filtered. A storage optimization could be filter the map in order to remove the unused object (i.e. buildings).(but this is not related with the size of the final graph)

RudiThoeni commented 2 years ago

thx, always happy when there are logic explanations ;) will fix it!

rcavaliere commented 2 years ago

@RudiThoeni I confirm that we have too many GTFS files in this folder... just keep the latest version of it!

zabuTNT commented 2 years ago

One more thing, the file LatestGTFS.zip must be deleted, not only the files named gtfs_XXX, because probably is the cause of missing clustering on frontend.

STA added parent station inside gtfs after that date.

So if exists a gtfs without parent stations we will see on map: clusters (from the new gtfs) and single stops too(from this one)because OTP threat them as different elements because from 2 different GTFS.

OTP allow multiple gtfs even if they have the same agency because the calendar could be different (future services) for example or because doesn't exist a rule for the agencies IDs so maybe 2 agency could have the same id (otp create a unique in case)

RudiThoeni commented 2 years ago

thx @zabuTNT
Now on the testserver i deleted all the old gtfs_xx files, and added a logic to the script that deletes all old gtfs files if a new one is downloaded.... also latestGTFS.zip was deleted. A graph calculation now is done in < 6 minutes, and the graph size is about 450 MB ;)

I deployed everything on the TEST Server, @rcavaliere please have a look if all stations are inside then i will adapt every change to production.... I attach here the log of the otp applications just to be sure that everything is right otp_start_log.txt

Now when i compare prod with test i see a lot more points on prod, like you explained the single stops are not shown anymore (at example on station Merano)

(prod) image

(test) image

rcavaliere commented 2 years ago

@RudiThoeni @zabuTNT thanks to both, on testing it seems to be much better and cleaner. The complex stations have also all the detailed stop points, so I think we can put this on production.

RudiThoeni commented 2 years ago

ok fixed and everything on production now.......