opentripplanner / OpenTripPlanner

An open source multi-modal trip planner
http://www.opentripplanner.org
Other
2.13k stars 1.01k forks source link

Replace Java serialization #2643

Closed abyrd closed 3 years ago

abyrd commented 5 years ago

Java serialization is known to be very slow and space inefficient. We've long depended on FST serialization library for the R5 project, and are now migrating R5 to the widely-used Kryo library to allow saving larger networks.

Now that I've figured out how to configure and use it, we should also migrate OTP to Kryo. There is a potential for very significant speed and size improvements, and much faster OTP startup.

abyrd commented 5 years ago

I am able to serialize an entire Graph to disk with Kryo. I had to provide a custom serializer for unmodifiable collections, which are used by OBA GTFS library (though those classes won't appear in our serialized graphs once we merge #2495). It appears that some Guava classes were serialized as generic Maps and cannot be deserialized. OBA also contains some maps with unmodifiable lists as keys that are confusing the custom unmodifiable collection deserializer. This needs some work, which might be easier if we do it after we have control over the internal OTP data model.

Here is the histogram of serialized classes and the serializers Kryo is applying to them:

Count Class Serializer
4515168 StreetEdge FieldSerializer
2354037 OsmVertex FieldSerializer
1627550 NonLocalizedString FieldSerializer
1099025 Integer IntSerializer
706010 TransitStop FieldSerializer
494838 PatternArriveVertex FieldSerializer
494763 PatternDepartVertex FieldSerializer
494056 TransitBoardAlight FieldSerializer
448371 AgencyAndId FieldSerializer
433132 PatternHop FieldSerializer
390085 LocalizedString FieldSerializer
385892 StreetTransitLink FieldSerializer
260479 GeometryFactory FieldSerializer
260478 Double FieldSerializer
260477 LineString FieldSerializer
242456 PatternDwell FieldSerializer
216052 Trip FieldSerializer
187521 TripPattern FieldSerializer
175037 ArrayList CollectionSerializer
165474 SimpleTransfer FieldSerializer
133961 TransitStopArrive FieldSerializer
133886 TransitStopDepart FieldSerializer
123180 SplitterVertex FieldSerializer
117750 String StringSerializer
102360 BitSet JavaSerializer
99414 TripTimes FieldSerializer
24952 AreaEdge FieldSerializer
24154 TraverseModeSet FieldSerializer
16094 IntArray FieldSerializer
13744 PreBoardEdge FieldSerializer
13744 PreAlightEdge FieldSerializer
11076 AreaEdgeList FieldSerializer
8083 BarrierVertex FieldSerializer
3907 P2 FieldSerializer
3904 LinkedList CollectionSerializer
3903 SpecificTransfer FieldSerializer
3903 StopTransfer FieldSerializer
3786 TransitBoardAlight[] ObjectArraySerializer
3786 PatternHop[] ObjectArraySerializer
3482 TurnRestriction FieldSerializer
2024 StringArray FieldSerializer
1972 Stop FieldSerializer
1893 PatternArriveVertex[] ObjectArraySerializer
1893 PatternDepartVertex[] ObjectArraySerializer
1893 StopPattern FieldSerializer
1893 TransitStop[] ObjectArraySerializer
1893 Timetable FieldSerializer
1893 PatternDwell[] ObjectArraySerializer
1864 PatternInterlineDwell FieldSerializer
1864 HashBiMap MapSerializer
1683 ExitVertex FieldSerializer
1421 TranslatedString FieldSerializer
1276 FreeEdge FieldSerializer
640 ElevatorHopEdge FieldSerializer
638 ElevatorBoardEdge FieldSerializer
638 ElevatorAlightEdge FieldSerializer
632 ServiceDate FieldSerializer
560 TurnRestrictionBad FieldSerializer
483 ParkAndRideVertex FieldSerializer
407 GraphConnectivity FieldSerializer
345 HashSet CollectionSerializer
211 ElevatorOnboardVertex FieldSerializer
207 ElevatorOffboardVertex FieldSerializer
204 ParkAndRideLinkEdge FieldSerializer
94 Envelope FieldSerializer
71 TransitStation FieldSerializer
44 HashMap MapSerializer
34 StopNotLinkedForTransfers FieldSerializer
31 TransitStopStreetVertex FieldSerializer
30 ParkAndRideEdge FieldSerializer
29 UnmodifiableRandomAccessList UnmodifiableCollectionsSerializer
22 StopLinkedTooFar FieldSerializer
15 TurnRestrictionException FieldSerializer
12 HopSpeedSlow FieldSerializer
11 int[] IntArraySerializer
6 FareRuleSet FieldSerializer
5 TraverseMode EnumSerializer
4 Type FieldSerializer
4 Serializable2DPackedCoordinateSequenceFactory FieldSerializer
4 Agency FieldSerializer
4 StreetTraversalPermission EnumSerializer
4 PrecisionModel FieldSerializer
3 String[] StringArraySerializer
3 Class ClassSerializer
3 ZoneInfo TimeZoneSerializer
2 StaticStreetNotesSource FieldSerializer
2 Route FieldSerializer
2 CalendarServiceData FieldSerializer
2 EmptyMap CollectionsEmptyMapSerializer
1 WorldEnvelope FieldSerializer
1 FeedInfo FieldSerializer
1 LinearRing[] ObjectArraySerializer
1 OnBoardDepartServiceImpl FieldSerializer
1 ParkAndRideUnlinked FieldSerializer
1 MortonVertexComparatorFactory FieldSerializer
1 Graph FieldSerializer
1 int IntSerializer
1 void VoidSerializer
1 DateTime FieldSerializer
1 TurnRestrictionType EnumSerializer
1 RealTimeState EnumSerializer
1 FareAttribute FieldSerializer
1 Deduplicator FieldSerializer
1 FareType EnumSerializer
1 long LongSerializer
1 DefaultFareServiceImpl FieldSerializer
1 byte ByteSerializer
1 byte[] ByteArraySerializer
1 TransferTable FieldSerializer
1 MavenVersion FieldSerializer
1 short ShortSerializer
1 boolean BooleanSerializer
1 Note FieldSerializer
1 LinearRing FieldSerializer
1 double[] DoubleArraySerializer
1 OnBoardDepartService FieldSerializer
1 Coordinate FieldSerializer
1 Double DoubleSerializer
1 StreetNotesService FieldSerializer
1 StopClusterMode EnumSerializer
1 float FloatSerializer
1 Date DateSerializer
1 Stop[] ObjectArraySerializer
1 double DoubleSerializer
1 GraphBundle FieldSerializer
1 char CharSerializer
1 Polygon FieldSerializer
1 Graphwide FieldSerializer
1 FareService FieldSerializer
1 HashMultimap FieldSerializer
abyrd commented 5 years ago

@t2gran this work on serialization might interest you. I have pushed initial work to the branch kryo-serialization.

t2gran commented 5 years ago

Yes, indeed it it very interesting, we did test kryo a while back, but due to other priorities we did not finish the job. I probably introduced the unmodifiable collections to enforce encapsulation - but it can be loosened up to make the serialization work , and we can add it back in later.

We have also tested protostuff. @gmellemstrand can you provide more details. It was easier to set up and the deserialization is slightly faster:

https://github.com/eishay/jvm-serializers/wiki

I have not used these two libraries, so I cannot say witch is best. But I my guess is that they are similar, also on long term maintenance - witch I think is important for which one to choose.

I will ask @gmellemstrand to make the code available for the protostuff and a few words about the result, so can we compare the two libraries.

csolem commented 5 years ago

The code for experimenting with protostuff is in this branch: protostuff_poc And the test file: https://github.com/entur/OpenTripPlanner/blob/protostuff_poc/src/test/java/org/opentripplanner/routing/core/ProtoStuffTest.java @gmellemstrand will push his recent changes to this branch.

How we tested:

We started by testing with a minimal dataset and we had one issue, the edge arrays in Vertex (incoming and outgoing) had to be instantiated after deserialization. Except for that, we were able to run a routing test after deserialization.

The second attempt was to use real data to build the graph. In our case norway-latest.osm and rb_norway-aggregated-gtfs.zip. Then, we had an issue during deserilization:

java.lang.RuntimeException: Reading from a byte array threw an IOException (should never happen).
    at io.protostuff.GraphIOUtil.mergeFrom(GraphIOUtil.java:63)

@gmellemstrand found out that it might had to do something with head signs. He temporary removed head signs and it worked. Maybe it could be solved more permanently with delegates?

The reason why we tested protostuff was because it worked well with automatic schema changes for other large models we use: SIRI and NeTEx. NeTEx has over 2800 class files generated from XSD so Protostuffs automatic schema generation is useful over Protobuf. Proof of concepts for these models, with marshalled xml for comparing the result, can be found here for SIRI and here for NeTEx

It would be interesting to see how Protostuff compares with Kryo on performance, size and maintainability in this usecase.

gmellemstrand commented 5 years ago

@abyrd

We have managed to serialize and deserialize a graph now and the resulting graph is mostly identical. This is tested in the class ProtoStuffTest. Some graph fields are made public for the purpose of this test.

https://github.com/entur/OpenTripPlanner/tree/protostuff_poc

abyrd commented 5 years ago

Thanks for all the additional information @csolem and @gmellemstrand. It sounds like your original research into fast/compact serialization was because you were looking for ways to serialize SIRI/Netex data model classes. What was the use case for that?

I would of course like to make a fair comparison between Protostuff and Kryo. My (as yet not fully researched) initial impression is that Protostuff might be intended for slightly different use cases - if there is a parallel with Protocol Buffers, the idea of "forward-backward compatibility" might be intended to allow loading newer messages into older systems and vice versa, which we don't necessarily want in OTP - it just seems safer to break all the saved graphs when the schema changes.

The fact that Protostuff seems to require an explicit schema-generating step seems different than Kryo but may be an advantage - as it is, we just trust the serialization system to faithfully replicate the in-memory structure without examining its conclusions about what it should save and how.

I hope to adapt your tests to also verify that Kryo is performing a lossless round-trip of the Graph.

csolem commented 5 years ago

The reson for testing protostuff with SIRI/Netex data was because of slow XML marshalling/unmarshalling, high memory footprint when handling large amounts of concurrent data and the ability to separate XML marshalling from other components.

abyrd commented 5 years ago

@csolem what I meant is: why did you need to serialize and deserialize SIRI/Netex data model classes? If it's not in XML, I'm guessing it's internal to your own trip planning data pipeline.

I'm not necessarily looking for a detailed description, just trying to understand how your system works. Most of my work is within trip planning components, and perhaps my understanding is biased by that.

I see a comment in org.opentripplanner.routing.core.ProtoStuffTest#testProtoStuffWithEdge where it says:

// Seems like I have to use GraphIOUtil instead of ProtostuffIOUtil to avoid stack overflow exception with SIRI

But I don't understand why OTP would be serializing SIRI objects or Netex objects in its graph.

abyrd commented 5 years ago

@gmellemstrand / @csolem the branch protostuff_poc imports org.apache.commons.lang.builder.EqualsBuilder in GenericObjectDiffer but does not declare this dependency in the POM, I had to add a dependency on Apache commons lang.

abyrd commented 5 years ago

I have done a few round trips on Portland data to test out and observe Protostuff. My initial observations are things that will become relevant with large graphs.

ProtoStuffTest writes the graph out to a byte buffer, then copies that byte buffer to disk. Likewise, when loading, the graph is decoded from that same byte buffer (without re-loading it from disk). For large graphs we will instead want to stream the serialized representation to disk, otherwise we could easily double peak memory requirements when saving or reloading the graph. I first checked whether Protostuff supports streaming serialization and deserialization. If we use GraphIOUtil.writeTo(new FileOutputStream("protostuff.file"), edgeInfo, schema, buffer) this in turn creates a new ProtostuffOutput(buffer, out) which uses a WriteSink.STREAMED. Pausing execution of io.protostuff.GraphIOUtil#writeTo at its final line LinkedBuffer.writeTo(out, buffer) we see that this final line is just flushing the buffer, which has not grown beyond its original size or added new linked buffers. So the serialization appears to be fully streamed, without accumulating anything in memory. So far so good.

However, when the same approach is applied to read the file using GraphIOUtil.mergeFrom(new FileInputStream("protostuff.file"), edgeInfoFromProtostuff, schema), we get the error ProtobufException: Protocol message was too large. May be malicious. Use CodedInput.setSizeLimit() to increase the size limit. We see that CodedInput.setSizeLimit() takes an int parameter, and stores the limit in an int field. Its Javadoc says it is intended to prevent int overflows, and that "size limits only apply when reading from an InputStream, not when constructed around a raw byte array." However Java arrays use 32-bit signed integers indexes.

I would conclude that Protostuff is not intended for messages >2GB in size, and not suited for writing and reading such messages from streams. There has been an open issue about this since 2012, and an issue comment from last year says that the default limit could be increased to 2GB but does not mention going beyond that (https://github.com/protostuff/protostuff/issues/155).

My general sense is that Protostuff is mostly intended for sending small messages across the wire and cacheing moderate-sized objects, not really for persisting huge object graphs. Kryo does seem to be intentionally designed to support our use case. Kryo's Input/Output#total was changed to a long in November of 2013. But of course it should still be tested with such large graphs to verify that it works as expected.

Finally, also note that the default buffer size in Protostuff is very small at 512 bytes - in my tests I increased it to 8k which is generally an efficient size.

abyrd commented 5 years ago

On the other hand, using our current configurations Protostuff seems like a clear leader on speed and size. On my test data set, it writes 179MB in 3 seconds as opposed to Kryo's 366MB in 8 seconds. Probably Kryo could be improved by registering a bunch of custom serializers, but I'd rather avoid that if possible. Protostuff seems to do well without so much customization. If only it didn't have size limits...

csolem commented 5 years ago

@csolem what I meant is: why did you need to serialize and deserialize SIRI/Netex data model classes? If it's not in XML, I'm guessing it's internal to your own trip planning data pipeline.

You are right, it is internal to our own trip planning data pipeline, when it comes to receiving and delivering SIRI/NeTEx data. Sorry for the confusion. It was only ment to explain why we started to test protostuff, and why we also wanted to test it with OTP Graphs.

But I don't understand why OTP would be serializing SIRI objects or Netex objects in its graph.

Comment comes from the SIRI proof of concept in another component. Please ignore it. Sorry for the confusion.

@gmellemstrand / @csolem the branch protostuff_poc imports org.apache.commons.lang.builder.EqualsBuilder in GenericObjectDiffer but does not declare this dependency in the POM, I had to add a dependency on Apache commons lang.

Ok. It was probably used to compare graph objects temporarily in the test. Will be removed.

ProtoStuffTest writes the graph out to a byte buffer, then copies that byte buffer to disk. Likewise, when loading, the graph is decoded from that same byte buffer (without re-loading it from disk). For large graphs we will instead want to stream the serialized representation to disk, otherwise we could easily double peak memory requirements when saving or reloading the graph.

I agree. The byte array was only for the sake of testing. Will be changed to write to file of course. I will clean up and provide a more complete example.

However, when the same approach is applied to read the file using GraphIOUtil.mergeFrom(new FileInputStream("protostuff.file"), edgeInfoFromProtostuff, schema), we get the error ProtobufException: Protocol message was too large. May be malicious. Use CodedInput.setSizeLimit() to increase the size limit.

It is possible to get around that limit. But it could of course be fixed in protostuff.

Generally, I think that graph serialization should be separated from the Graph class. Then, it would be easier to switch between implementations (kryo, protostuff, java). I can provide a suggested solution for the deserialization part. I will also clean up the test with file writing and remove confusing comments.

csolem commented 5 years ago

I have created a new branch for cleaning up and an attempted solution (WIP) for separating graph serialization from the Graph class: protostuff_poc_separation_deserialization

Getting around the protostuff size limit mentioned above can be seen here: https://github.com/entur/OpenTripPlanner/blob/protostuff_poc_separation_deserialization/src/main/java/org/opentripplanner/serializer/ProtostuffGraphSerializer.java#L39

In addition to the fix for size limit, ProtostuffTest now writes to file, no byte[] array. I have tested this by successfully serializing and deserializing a complete Norwegian graph from gtfs and osm data: Main graph size: |V|=3643111 |E|=7908678. A Norwegian graph file size is normally around 3 GB. With protostuff, the file seems to be around 1.9 GB.

With portland gtfs and Oslo minimal osm data (which is configured in ProtoStuffTest), typical log output with time diffs is:

11:44:33.219 INFO (GraphSerializerService.java:79) Serializing graph. Main graph size: |V|=130356 |E|=272649
11:44:34.147 INFO (GraphSerializerService.java:83) Graph serialized in 927 ms
11:44:34.147 INFO (GraphSerializerService.java:40) Reading graph from file: /home/cristoffer/rutebanken/opentripplanner/graph.protostuff ...
11:44:34.643 INFO (GraphSerializerService.java:52) Deserialized graph using: ProtostuffGraphSerializer in 495 ms

I have extracted load/save methods from Graph to a class called GraphSerializerService. This service calls the desired implementation for serialization. The intention is to be able to switch between implementations. This could come handy for testing and comparing different approaches. It might be helpful for migration purposes as well. Features like load levels and debug info are not supported yet. Some fields in Graph are transient and serialized separately. To separate this for Protostuff, the GraphWrapper class is created. This is work in progress. If we think it is a good idea to separate and compare, we can extract these changes to a clean PR.

abyrd commented 5 years ago

Hi @csolem, thanks for your responses and modifications. I think it's a good idea to separate out the serialization from the Graph class, to allow us to try different implementations, thanks for taking that step.

EqualsBuilder in GenericObjectDiffer was indeed used to compare the graph in a test, but that test was still present in the branch. I'm not saying the dependency and test should be removed - we should indeed have a round-trip serialization test that actually does a deep compare of the objects.

About the size limit, in the ProtostuffGraphSerializer.java you linked to, you are using the CodedInput.setSizeLimit() I mentioned in a previous comment - what I was pointing out is that it takes an int parameter, and stores the limit in an int field, and the javadoc says it prevents int overflows in the serialization process.

You have just set the limit to 2 million, which works because it happens to be less than 2^31, but this doesn't solve the limit problem. It looks like it will be impossible to serialize graphs over 2GB, and that Protostuff is not designed to do so.

So it's certainly worthwhile to made some measurements on Protostuff since you've already done the work to set it up, and it does seem to be faster than Kryo with a default configuration, but should we really consider depending on a library that imposes a hard size limit?

csolem commented 5 years ago

So it's certainly worthwhile to made some measurements on Protostuff since you've already done the work to set it up, and it does seem to be faster than Kryo with a default configuration, but should we really consider depending on a library that imposes a hard size limit?

I agree. And it does not seem to be an easy fix to increase the maximum size in Protostuff (protostuff/protostuff#252). Because changing fields from int to long forces many changes in other parts of CodedInput. But at least with the separation of implementations it is easier to switch and choose.

abyrd commented 5 years ago

Features like load levels and debug info are not supported yet.

@csolem I actually think we should remove the load level functionality and possibly the additional debug info, I don't think I have ever used it since it was added and it makes this code more complex.

csolem commented 5 years ago

Ok, @abyrd. I will prepare a clean PR for the separation of serialization. It will only contain the separation part to make it easier to review. Then, the next step could be to add other serializers.

abyrd commented 5 years ago

Hi @csolem, I'm back working on this after a discussion with @t2gran and @gmellemstrand. My OTP branch kryo-serialization now appears to properly serialize and deserialize the OTP graph using Kryo, at least as reflected by manual testing via a web trip planning client. However I'm running into difficulties testing this rigorously. I first attempted using DeepEquals from https://github.com/jdereg/java-util and isEqualToComparingFieldByFieldRecursively() from http://joel-costigliola.github.io/assertj/ but both encountered significant differences between my pre- and post-serialization graphs. This very likely has something to do with harmless differences such as the exact ordering of elements in collections.

Therefore I copied in the org.opentripplanner.common.diff package from your protostuff_poc branch. A comparison using the method outlined in your test GenericObjectDifferTest shows that the OTP Graphs pre- and post-kryo-serialization are identical except for expected differences (loss of vertices that don't have any edges).

I see now though that the complete diffing logic is implemented in this package org.opentripplanner.common.diff, except for the Apache Commons methods for generating equals() methods. Questions for you @csolem:

  1. Is this code copied in from somewhere or did you write it?
  2. Does this code have any tests other than the basic GenericObjectDifferTest that tests an empty graph?
  3. Why did you create this differ class, did you also have trouble with existing library methods for deep object comparisons?

I started adding some tests to verify how this differ works, but was able to create cases where objects test equal when they are not actually equal.

We do need something to test whether our serialization process recreates the same graph. I'm hesitant to perform such tests with our own diffing tools though unless those tools themselves have extensive tests. The ideal would be to use an existing tool if we can find a suitable one.

Any input you can give on the reasons for creating a custom diff tool would be very helpful - I assume there were just features you couldn't find elsewhere.

abyrd commented 5 years ago

So, according to my tests using your custom diff process, round trip serialization through Kryo leaves the graph unchanged. The observed times and sizes for Portland, Oregon data are:

Java Serialization of Portland

save 5.77 sec load 50 sec size 218059204B

Kryo Serialization of Portland

save 4.3 sec (25% reduction) load 3.5 sec (93% reduction) size 169384587B (22% reduction)

If this is correct, it should allow for much faster horizontal scaling since graphs can be loaded 93% faster. This can probably be improved further with more custom serializers, but the improvement is already good and we may want to limit increases in code complexity - it can be challenging to keep custom serializers in sync with the classes they handle (like custom hash code or equals methods).

csolem commented 5 years ago

Questions for you @csolem:

  1. Is this code copied in from somewhere or did you write it?

I wrote it, and copied it from another repository I have been working on. The reason why I copied it, was to try to compare graphs before and after serialization with protostuff. (The original code is located here: https://github.com/entur/tiamat/blob/master/src/main/java/org/rutebanken/tiamat/diff/generic/GenericObjectDiffer.java)

  1. Does this code have any tests other than the basic GenericObjectDifferTest that tests an empty graph?

Yes. https://github.com/entur/tiamat/blob/master/src/test/java/org/rutebanken/tiamat/diff/generic/GenericObjectDifferTest.java But some tests are using internal model classes relevant for that project.

  1. Why did you create this differ class, did you also have trouble with existing library methods for deep object comparisons?

I started adding some tests to verify how this differ works, but was able to create cases where objects test equal when they are not actually equal.

We do need something to test whether our serialization process recreates the same graph. I'm hesitant to perform such tests with our own diffing tools though unless those tools themselves have extensive tests. The ideal would be to use an existing tool if we can find a suitable one.

I agree. It is probably better to find an existing tool to do the diffing. The GenericObjectDiffer tool was added in the poc branch to help us with kryo and protostuff testing, but merging it to OTP master was never the intention. (But if not such tool can be found, this diff tool could be extracted to a separate git repository and improved.)

csolem commented 5 years ago

What about having a closer look at these tools?

abyrd commented 5 years ago

Thanks for the thorough explanation of the GenericObjectDiffer @csolem. If I can't find a suitable replacement, then perhaps it is worthwhile to pull your GenericObjectDiffer out into its own Maven project. But it would need to be released to Maven Central to be very useful in external projects, which might not be worth the extra effort to you.

About the two other libraries you identified:

It says "you can easily write your own introspectors and just plug them in via configuration API" but then I expect we'd just be writing the bulk of the comparison library ourselves.

Your GenericObjectDiffer seems pretty good and meets all the requirements you listed, maybe I can just identify the source of the test failure I produced and fix it.

abyrd commented 5 years ago

Examining the GenericObjectDiffer code, I believe I have found the reason my tests fail. The overall approach is invoke a method that tests the equality of two objects, which then recursively tests for equality of fields or equality of elements in collections.

The top-level entry point is GenericObjectDiffer#compareObjects() which performs a field-by-field comparison of the two Objects it's given. Each field is then compared using identity equality, then as a Collection, or as a Map, using equals() or a generated equals() method, then finally falling back on recursive field-by-field comparison.

The problem is that the top-level object is always compared field by field. If it is a Collection, Map, or object that should be compared with equals() this is not taken into account. Likewise in GenericObjectDiffer#compareCollectionItems() the items in a collection are assumed to be of types that should be compared field by field.

My proposed solution is to separate out the field-by-field comparison strategy from the method that chooses the comparison strategy, essentially inverting these two elements of compareObjects(). All object comparisons should begin by detecting which strategy for comparison will be applied, then applying the strategy, of which field-by-field is one.

abyrd commented 3 years ago

Kryo serialization has completely replaced Java, but we're now running into overflow problems with large graphs (see #3128). My sense is that this is less of a problem in OTP2 which has less vertices and edges (since transit data is tabular), but we may want to look at whether this is resolved in newer versions of Kryo.