Persistence - Performance Improvement

sageserpent-open commented 5 years ago

Improve performance of the persistent implementation introduced in #16.

sageserpent-open commented 5 years ago

Under investigation: tuples are being proxied at the moment, and there are a lot of delayed loads made through the proxy to access the '_1' component. Why is this the case, and what would be the impact of not proxying it?

sageserpent-open commented 5 years ago

Another thought - as expected, there is a substantial penalty for using synchronous calls to H2. Overall, there is quadratic scaling of the runtime for the benchmark in its form from commit 18a3cd8ed75b0e68d9aa5308c86771f1883b0d91.

Is this connected with Doobie / H2 / poor synchronous use, or is this higher up the stack in 'ImmutableObjectStorage' / Kryo etc?

I should experiment with the benchmark running with a hacked word implementation that uses 'FakeTranches'.

sageserpent-open commented 5 years ago

Hmm - with 'Tuple2' proxying disabled - it takes the world spec: 5 minutes 30 seconds 6 minutes 47 seconds to complete.

sageserpent-open commented 5 years ago

OK, committed 35c249fee0ae099d7065c9d4ff04c9621c988244 with proxying disabled on 'Tuple2' to see the effects on the Travis build.

Running the benchmark still shows the quadratic scaling, the coefficients are only slightly smaller than at 18a3cd8ed75b0e68d9aa5308c86771f1883b0d91.

The next stop is to go back to using 'FakeTranches' and to get to the bottom of that quadratic scaling.

sageserpent-open commented 5 years ago

It takes the world spec a mere 4 minutes and 11 seconds when the 'H2Tranches' is swapped out for a copy of 'FakeTranches', this is from commit c0661929235e7d56c24c6e7a6b325d650338828b.

Irrespective of the quadratic scaling, the interaction with H2 needs finessing, badly. This is made more obvious by some simple profiling that shows the H2 is definitely not the bottleneck - it is consuming very little CPU time.

sageserpent-open commented 5 years ago

Benchmarking at commit c0661929235e7d56c24c6e7a6b325d650338828b, there is still a quadratic scaling even with 'FakeTranches'. This doesn't come as a surprise, and points at 'ImmutableObjectStorage' as being a possible suspect.

sageserpent-open commented 5 years ago

Facepalm moment: realised that the optimization to the old 'QuestionableTranches' implementation was never copied over to 'ImmutableObjectStorage.FakeTranches', which has in turn being copied again (!) into 'WorldH2StorageImplementation.FakeTranches' as a temporary measure while investigating. As of commit 732829539849549d397e77c871422973bb1c7245 that reapplies the optimization across both copies, this brings the time for the world spec down to a mere 2 minutes and 36 seconds.

sageserpent-open commented 5 years ago

Benchmarking the code at commit 732829539849549d397e77c871422973bb1c7245 reveals a definite decrease in the quadratic scaling. Let's see where the residual scaling lies...

sageserpent-open commented 5 years ago

Aha! Looking through a debug session reveals a lot of forced loading of underlying objects by their proxies due to mundane method calls such as '.hashCode' or 'itemClazz' or 'timeOrdering'. Could these be cached when the underlying object is stored?

sageserpent-open commented 5 years ago

Looking at method counts reveals a quadratic scaling via hash trie maps use - specifically via the immutable hashmap implementation that is used by both 'AllEventsImplementation' and 'BlobStorageInMemory'. This would appear at first glance to be unrelated to 'ImmutableObjectStorage'; does this quadratic scaling exist in 'WorldEfficientInMemoryImplementation' too?

sageserpent-open commented 5 years ago

Thinking about the performance work in #18, omitting the queries proved to be a useful gambit in locating the final source of quadratic scaling.

sageserpent-open commented 5 years ago

More musing ... is it worth proxying any case class? May as well just pull them in wholesale - this prompted especially by 'UniqueItemSpecification', which has instances all over the place.

sageserpent-open commented 5 years ago

Another musing - when making a query, could the structure of 'BlobStorageInMemory' cause its internal proxies to load across the board for all ids / lifecycles?

sageserpent-open commented 5 years ago

OK, so running the method counts benchmark against the older 'WorldEfficientInMemoryImplementation' reveals linear scaling and the complete absence of calls to hash trie maps in the top ten highest counts.

So the good news is that the common code shared between the two implementations isn't the source of the quadratic behavior - at least not in its own right, perhaps it interacts poorly with 'ImmutableObjectStorage'. Let's revisit those calls to the hash trie map again...

sageserpent-open commented 5 years ago

So, 'WorldEfficientInMemoryImplementation' also makes lots of calls to hash trie maps via 'AllEventsImplementation' and 'BlobStorageInMemory', just not in the vast numbers that 'WorldH2StorageImplementation' does. Perhaps this was a red herring - is there another source of calls peculiar to 'WorldH2StorageImplementation'.... ?

sageserpent-open commented 5 years ago

A cursory look via debugging indicates not - perhaps it is (de)serialization that is hitting the values of the hash trie map that is causing the churn - this would indeed fit what is seen in the method counts, where scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems() are being hammered.

sageserpent-open commented 5 years ago

Running the method counts benchmark on 'WorldH2StorageImplementation' with querying disabled still reveals quadratic scaling via scala.collection.immutable.HashMap$HashTrieMap.bitmap() and scala.collection.immutable.HashMap$HashTrieMap.elems().

This leads me to suspect that any loading of a tranche for a timeline is forcing wholesale loading of other tranches - this may be a case of the structure sharing between tranches acting pathologically, where a simple reference to something in another tranche leads to everything in the other tranche being loaded, which somehow cascades to the next tranche and so on...

sageserpent-open commented 5 years ago

One source of the quadratic scaling is a 'tranche avalanche', whereby the Chill collection serializer on deserialization starts inserting elements one by one into a collection builder. For a hash map, this requires that each newly added element be queried for its hash-code; thus forcing its proxy to have to load its underlying object. Not only can this recurse, it also causes incidental loading of collections stored elsewhere in the tranche needed to resolve the underlying object, which causes still more recursive loading to take place.

As of commit ebfe12ad5af587be27a9d80cd3d8ed2bef63136e, the Chill serializers have been omitted and only the default Kryo ones are used to circumvent this. This brings the world spec time down to 2 minutes, 22 seconds, which is definitely progress...

sageserpent-open commented 5 years ago

It looks like the tranche avalanche has been circumvented - debugging and logging shows an absence of any proxies being loaded in the context of another proxy load from a referring tranche.

There are a lot of requests to load hash maps which seem to be unavoidable - these stem from the code in 'TimelineImplementation' that has to load these maps to calculate update timelines, just as it does for 'WorldEfficientInMemoryImplementation', which has linear scaling.

Currently there is still some quadratic dependency via underlying object loading. One possibility is that this is caused by the constructor precondition checking for 'AllEventsImplementation', which scans through a bounded sample of lifecycles and thus forces loading. The bound is 100 lifecycles; this is of the some order of magnitude as the number of items dealt with by the benchmark for sizes up to 500 steps, so perhaps that quadratic term is due to the number of lifecycles being scanned 'filling up' to 100 as the number of steps goes up.

If so, then reducing the sample size bound to say, 5, should reveal linear scaling.

sageserpent-open commented 5 years ago

Note to self: consider reverting commit a9a164578ed1ea9007b515d79bbfafabacddfea3 if the cutover to using tree maps to underpin Quiver doesn't yield significant performance benefits?

sageserpent-open commented 5 years ago

Tried reducing the sample size in commit d6ec90aef4d7465f5009cf7d17e978623437a76c.

This did knock down the quadratic dependency a little, perhaps the constructor preconditions in 'AllEventsImplementation.LifecycleImplementation' are also causing trouble?

sageserpent-open commented 5 years ago

More musing: the benchmark shotguns item ids across a whole bunch of events, where the number of item ids is proportional to the benchmark size. This means that the same item id can appear in multiple revisions, and thus be involved in multiple tranches. Perhaps this is causing loading of multiple tranches when the item state update dag is used to perform recalculation?

sageserpent-open commented 5 years ago

Disabling the contracts altogether in 'AllEventsImplementation' and 'AllEventsImplementation.LifecycleImplementation' only knocked a little bit of the quadratic scaling off, so I'm backing out of that line of attack now .

sageserpent-open commented 5 years ago

Tried an experiment in which tranches are preserved wholesale between distinct sessions by parking them in the tranches instance passed to a session. This immediately gives linear scaling, even though new tranches are being stored.

Clearly we can't just store all the trances in memory, (not least because it breaks some of the 'ImmutableObjectStorageSpec' tests concerning tranche corruption), but it does point out that the remaining scaling issue is down to a tranche avalanche.

Is there perhaps some obvious thing that links the tranches in long chains, or should the code be cutover to maintaining a bounded cache of tranches?

sageserpent-open commented 5 years ago

Two ideas:

Cache loaded tranches in a cache of bounded size with an LRU policy.
Introduce the notion banning inter-tranche references for certain classes, but continue to use local reference counting.

sageserpent-open commented 5 years ago

Interesting - running the benchmark and looking at the number of tranches loaded in each session, it appears that this scales linearly with the number of revisions made while making revisions, also linearly with a massive scatter when making queries.

Is this really down to the nature of the benchmark or is something still forcing a inter-tranche references to be loaded?

sageserpent-open commented 5 years ago

OK, done a lot of work on this and found option 2 to be a bad idea.

Option 1 morphed into caching of objects between session by both reference id and by tranche id, and that has yielded fruit...

sageserpent-open commented 5 years ago

So, here is a graph of benchmark time versus the number of revision / query steps in the benchmark for commit eacb316c497bd7f592e0fcdee5044c4c55ee6c64:

TimingsFor_eacb316c497bd7f592e0fcdee5044c4c55ee6c64

This is using 'FakeTranches' as the backend.

sageserpent-open commented 5 years ago

So a revision followed by a single query takes 17 milliseconds per step.

sageserpent-open commented 5 years ago

TODO: a lot of bugs in the 'ImmutableObjectStorage' were again only noticed when running the world tests. Need to do another round of test catchup in 'ImmutableStorageSpec'.

One obvious thing to do is to make sure that objects hang around between sessions with a mixture of references to proxies whose underlying objects have not yet been loaded and references to non-proxied objects or proxies with loaded underlying objects.

I'll check through some of the other test failures seen on the Travis build to see whether 'ImmutableObjectStorageSpec' picked up on them too.

sageserpent-open commented 5 years ago

Got some cleanup of the code to do, plus switching back to H2...

sageserpent-open commented 5 years ago

Oh ... and there are all of ignored tests to sort out. Fortunately they are broken for a good reason, namely that the caching is working around the tranche retrieval failures provoked by the tests. The solution is to configure the caching so these tests can disable it, or at least thwart the caching's ability to mask the specific corruptions the tests generate.

sageserpent-open commented 5 years ago

Unfortunately, switching back over to H2 for the tranches implementation gives us a quadratic dependency:-

TimingsFor_9c075ced4323b92cf73bd8e714cf403b5ca01af4

sageserpent-open commented 5 years ago

This is despite having defined an index on the object reference table in H2 to optimize the rather obvious bottleneck of querying for the maximum object reference id.

sageserpent-open commented 5 years ago

Musing again, it is clear that each tranche can contain a lot of binary data as it is taken from an entire timeline object, which includes blob storage and all events. Writing and storing this binary data is probably a bottleneck when interacting with H2 - whereas the data is simply held in memory and referenced when using the fake tranches implementation, even the in-memory variant of H2 has to copy the data into its own storage model.

Currently benchmarking method counts again to check this.

Two things spring to mind to make the tranches more lithe if this is the case:

Lift the calculation of the timeline for a new revision into the session monad throughout the existing code, so that there are more finer-grained tranches.
Use a parallel storage mechanism for blob storage, rather than sticking with an in-memory implementation that is stored by 'ImmutableObjectStorage'.

sageserpent-open commented 5 years ago

Odd - as commit 86f0624cb68f66abe635c5eaab285efa949c46e6, it seems that all of the code from Plutonium, from Doobie, from Scalacache, from Kryo, from H2 and from JDBC either scales linearly in method call counts, or makes a trivial contribution.

However, something is responsible for a large quadratic scaling of the number of constructions of instances derived from scala.Product.

These could be proxies, which perhaps aren't being intercepted by the method counting as their classes are generated at runtime.

Maybe 'Tuple2' and company are connected with this?

If so, I would expect to see a similar problem when running against the fake tranches implementation - currently investigating this.

sageserpent-open commented 5 years ago

Interesting - using fake tranches, the methods counts for construction of scala.Product drop off the highest 40 counts after 1250 steps in the benchmark up to 1800 counts. All of the highest 40 methods counts scale linearly from 700 to 1800 steps.

Given that it is a Scala product instance that is the quadratic offender, it is not going to the H2 or JDBC that is the culprit (as they are implemented in Java), especially given that these do not call back into Scala code.

This also seems to imply that proxying of tuples isn't the problem.

sageserpent-open commented 5 years ago

Given that the quadratic behavior is not exhibited when using fake tranches, but is when using Doobie / H2 tranches, and given the earlier speculation about the overhead of copying binary data even when into an in-memory H2 implementation, perhaps it is worth implementing a Redis tranches implementation, or using ScalalikeJDBC or whatever to talk to H2 as a rather drastic tool to dividing up the search space for this quadratic scaling...

sageserpent-open commented 5 years ago

The latest insight is that the constructor calls on scala.Product look like they are dominated by contributions made by constructing various subclasses of cats.effect.IO.

Now, cats.effect.IO is used for resource acquisition in Plutonium, but given that this isn't a problem when using fake tranches (but keeping the resource acquisition enabled), it points to Doobie being the source, as it uses cats.effect.IO heavily in its API.

This could of course be down to the rather naive way that H2Tranches drives Doobie, making synchronous calls that harvest the results of updates / queries out if the IO monad. Doobie expects client code to stick in the IO monad as long as possible; so there is an obvious antipattern in H2Tranches.

The question is, would it be better to somehow coerce H2Tranches and friends into IO, or to use another synchronous library to interact with H2. Perhaps ScalikeJDBC or even, shudder, straight JDBC?

sageserpent-open commented 5 years ago

Reviewing 'ImmutableObjectStorage', it looks feasible to cutover 'Tranches' to work with 'IO', which would make 'H2Tranches' play far better with Doobie.

What I'm hoping is that I can use an execution context / IO shifting to move in and out of non-monadic code in the reference resolver implementation.

In the meantime, a side-jaunt to investigate scaling against a non-Doobie implementation of H2Tranches seems in order to see whether the promised linear scaling is possible at the end of the anticipated big cutover...

sageserpent-open commented 5 years ago

Interesting ... changed the number of steps in the method call count benchmark to range from 1750 to 2500 steps, and what is revealed is linear behaviour. It looks like the quadratic behavior seen before settles down at higher scales, at least for method counts.

Again, the leader is the constructor for scala.Product, followed by that of cats.effect.IO, which both stand out from a volley of things like the apply and constructor for cats.effect.IO.Pure and various other Cats bits and pieces.

H2 comes in after these and at about the same level as usage of scala.util.Either etc.

Results from commit 1ec00e5b8fe98ce887dfed7e4ae8a73a3f0209d1:

mostlyCatsMethodCallCounts

sageserpent-open commented 5 years ago

Good news - as of commit 4e916111dd53b4cac8b37ace1c08cf9a7a51c1cd, whereby ScalikeJdbc is used in lieu of Doobie, the method calls counts have lost the huge number of calls to scala.Product via Cats.

What we have in the lead is scala.Some.value() neck and neck with scala.Some.get() - presumably pattern matching calls, followed by kryo.io.Input.require() and kryo.io.Input.readVarInt().

All are nicely linear:

scalikeJdbcMethodCallCounts

There is still the possibility of refactoring to drive Doobie in the way that it's supposed to be used, but for now I'm going to take the easy road and stick with ScalikeJDBC - there are other things to do that are more urgent, running the time benchmark being one of them...

sageserpent-open commented 5 years ago

First look at timing results from commit de91ef885bec1477218f121aa0903c04313592a2:

TimingsFor_de91ef885bec1477218f121aa0903c04313592a2

Linear - 34 milliseconds per step.

sageserpent-open commented 5 years ago

Not so fast: extending the number of steps in the benchmark has revealed that performance drops severely once a certain number of steps is reached, about 5000 for commit de91ef885bec1477218f121aa0903c04313592a2.

As an example, here are timings for commit abf85a70bce72fac8749a5c9e1180f0b4e50bcfd:

HockeyStick_abf85a70bce72fac8749a5c9e1180f0b4e50bcfd

sageserpent-open commented 5 years ago

A couple of things to bear in mind with the above graph is that:-

The caches for top-level objects and for objects by reference id were jemmied so as not to flush at all. If they are set to flush, performance grinds to a complete halt instead once the caches start to evict entries.
There was still headroom for the garbage collector, but at the point where the curve kinks upwards sharply, the garbage collector's CPU usage started to remain consistently high as opposed to spiking periodically, as it did through the first part of the curve.

sageserpent-open commented 5 years ago

The latest idea is to cache completed operation data by tranche id in 'Tranches', thus keeping hold of the top level object and the reference resolver for each cached entry, rather than using a global cache of objects.

This has simplified the implementation of 'ImmutableObjectStorage', and if queries are disabled, yields excellent performance:

5fca0fb8b779f3f26acb50b6785aba9ae0f1aa48 (No querying)

sageserpent-open commented 5 years ago

In contrast, if queries are enabled, the performance drops radically once the cache needs to refill:

QueriesEnabled

The slope of the latter part of the curve is ~700 milliseconds per revision, contrast with the 8 milliseconds per revision of the graph above.

sageserpent-open commented 5 years ago

Given that the queries are all wrt to the latest revision, this begs the question as to how many tranches are being loaded, and why? Perhaps this is simply because the proxied items returned to the client code as a result of queries do eager loading from the blob storage.

sageserpent-open commented 5 years ago

As of commit c7836ff82ef5b6fcfc8bb9d1628af79f6c6a94e1, we have an overall graph of:

c7836ff82ef5b6fcfc8bb9d1628af79f6c6a94e1-Overview

The latter part of the graph is:

c7836ff82ef5b6fcfc8bb9d1628af79f6c6a94e1-Endgame

Thus providing a final scaling of 264 milliseconds per revision. Not yet time to break open the champagne, but a step in the right direction...

sageserpent-open commented 5 years ago

Actually, there is a slight quadratic dependency there:

c7836ff82ef5b6fcfc8bb9d1628af79f6c6a94e1-Endgame

sageserpent-open / plutonium

Persistence - Performance Improvement #60