Closed sageserpent-open closed 5 years ago
Commit 7246d57a09426723978a760579df0ad3b17c289f. This builds on the one just before by using Redis to provide a tranches implementation, but still using 'BlobStorageOnH2'.
Commit 2b7c84996ba9f4e2545d6af33c0864826a1147b0. This uses FakeTranches (in-memory), but still using 'BlobStorageOnH2'.
Note that by 250 000 revisions the process had reached a limit of 10G of RAM usage, so had to be terminated.
The latest conclusions are that there is some good in splitting out an extra database table (and this has also yielded noticeably quicker Travis builds to boot). Shadowing the blob storage implementation didn't yield any benefits. Using Redis for the tranches implementation didn't yield any real benefit, but did consume a colossal and ever-increasing amount of memory as the benchmark force-fed it tranches without purging keys. Using the fastest tranches implementation - FakeTranches didn't really help, an made it apparent just how much memory would be eaten up by tranches - they have to be stored out of memory, Redis won't cut it.
It may still be worth using Redis to explore a non-SQL blob storage mechanism, just to see the effects on scaling. The work done so far can also be used for testing other non-SQL solutions, such as HBase.
One thing - adding in logging has shown that the benchmark causes tranche loading to settle down to a very healthy trend of only loading the 'BlobStorageOnH2' objects for queries, which in turn have a constant overhead of 163 bytes for the codebase as of commit 2b7c84996ba9f4e2545d6af33c0864826a1147b0. This is punctuated by the odd load of one of bits and pieces of timeline, but these become more and more infrequent as the benchmark progresses, so tranche loading is off the hook regarding quadratic scaling.
Another - more logging shows that the size of the tranches stored in each step approaches a maximum for each of the three classes of tranche stored by a timeline - the graph, all events and blob storage on H2. By around step 11500, the number of tranches in each class that bests the previous maximum has essentially dropped to zero. So the tranche size on storage is also not a contributor to the quadratic scaling.
On that basis, it now looks to be down to:
Finessing the blob storage implementation - could use Redis for this as a proof of concept, and move over to HBase.
Checking for a quadratic dependency in 'ImmutableObjectStorage' or in Kryo.
'Timer' was introduced in commit 49e566c0215495992eb364e364cacdfc4b973e93 to allow individual blocks of code to be timed cumulatively with occasional sampling. The cumulative timings are partitioned by a stack-based set of categories set by areas of code further up the call stack leading to the timed block, so one can distinguish for instance between retrieving a tranche in the context of revising versus annulling, or fetching a snapshot in querying versus revising.
Looking at this from that commit onwards, it has become clear that the benchmark itself was not entirely fair with the SUT - the benchmark was:
The result of these was to potentially cause progressively longer recalculations as the benchmark progressed through its steps (due to both issues in combination, although with interrelationships between items it is possible to get poor scaling due to recalculation having to go through chains of related items).
There is another problem in that there is code in 'AllEventsImplementation.revise' that scales poorly as the average size of an item lifecycle increases, which is triggered by the second issue.
This wasn't noticed when working with 'WorldEfficientInMemory' implementation, presumably because in all previous benchmark runs the number of steps was kept below 10 000 either to allow Scalameter to complete in less than geological time, or more recently to avoid running out of memory.
Here are some graphs of cumulative time in milliseconds sampled every 50 steps, prior to changing the benchmark...
WorldH2StorageImplementation, revision time:
WorldEfficientInMemoryImplementation, revision time:
WorldEfficientInMemoryImplementation, broken down to AllEventsImplementation.revise:
Oh dear.
WorldEfficientInMemoryImplementation, broken down to performing incremental recalculation:
The main culprit is actually 'AllEventsImplementation.revise', whose average time per revision scaled linearly with the number of steps the benchmark performed.
Note that the issue of recalculation didn't seem to be a problem here, so perhaps the old versions of the benchmark were making enough corrections to a given event id to gradually move the event forward in time, thus avoiding the issue?
With the benchmark changed as of commit 5b8dfb1427cca93b9442ef6d600df4bc841e4f4b, it now uses event ids that are randomly selected from a sliding window, as are the item ids referred to in the event. This bounds the number of corrections made to a given event, and also bounds the number of events that can refer to a given item for a given world history revision.
This leads to...
5b8dfb1427cca93b9442ef6d600df4bc841e4f4b Overall revision time in 'ImmutableObjectStorage.unsafeRun':
5b8dfb1427cca93b9442ef6d600df4bc841e4f4b Timeline.revise:
5b8dfb1427cca93b9442ef6d600df4bc841e4f4b AllEventsImplementation.revise:
As an experiment, substituting in BlobStorageInMemory in lieu of BlobStorageOnH2 in commit 5b8dfb1 yielded this for the overall revision time in 'ImmutableObjectStorage.unsafeRun':
Looks like BlobStorageOnH2 is the way to go.
After some longer benchmark runs and experimenting with reintroducing the MVStore backend for H2, it is now quite clear that using the MVStore is causing dramatic livelocks starting around <~ 300000 revisions. These are due to the on-the-fly compaction thread locking the database for extended periods, we are talking over a minute's duration here. Searching the Internet reveals that this is a known issue, see here: https://github.com/h2database/h2database/issues/1570.
To round off the graphs, here is the timing benchmark for commit 2fc246cf47c52dd812c42846c3478f235afa4ca2, this one uses the PageStore as the backend for H2 in light of the observation above:
Looking at the cumulative number of fetches of snapshots from blob storage, the cumulative number of incremental recalculations and the transitive closure size of the thing queried for at each step of the benchmark revealed interesting insights, these are based on commit b930759e0676304a57f2bae961a50e78f9b67bed ...
The number of recalculations is pleasingly linear, whereas there is a slight quadratic dependency in fetching snapshot blobs. Investigation of the latter showed that as the benchmark progressed, it created chains of inter-thing dependencies that got longer and longer, so reconstituting a thing from blob storage tended to pull in increasingly larger numbers of other things in the chain, as items are eagerly reconstituted.
This is seen in the last the above three graphs, where there is an upwards trend in the transitive closure size of queried things as the benchmark progresses.
So what we have here is an increasing input cost that adds a quadratic dependency to an otherwise linear recalculation algorithm.
Commit eef3bb5a20b3f262a3956c93e226e362ac71925c changes the benchmark so as to work in a window of thing ids that hops periodically from one range to another contiguous, non-overlapping range. So as long as the ids are chosen from within the same window, chains of inter-thing dependencies can get longer, but as the window hops, those chains are left at their maximum size, and new chains are started.
This results in the following behavior...
For blob storage snapshot fetching, the quadratic dependency has a coefficient that is smaller by two orders of magnitude and the linear dependency coefficient has gone from ~50 to ~4. Much better.
The transitive closure size's linear increase has also dropped by two orders of magnitude, and the distribution of sizes is much tighter, dropping rapidly in frequency once the chain length gets larger then 3. This is reasonable, as we don't expect realistic client code to relate an item to an ever increasing number of older items, at least not for an MVP. It is conceivable that this caveat could be relaxed in future by lazily loading referenced items from blob storage, but for now this is good enough.
Commit eef3bb5a20b3f262a3956c93e226e362ac71925c was cherry-picked on to the main dag of development commits in as commit a8c9e39bc7dce554529f1e2cb06bd86063959f09.
Armed with the revamped benchmark, what do we get?
Commit 5825b04c1ec336dcb5f36e8d5f809a971fd8e725 ...
Commit 21f349bd198c4aded716ee496b597cff7854c7c5 ...
Commit 1dffcb8cca285a668f56b69bb94d315e983a9cee ...
Based on inspection of the memory profile via Visual VM as the benchmark is running, and also by using the Windows resource monitor to get a feel for the rate of disk reads and writes, it is apparent that from zero steps up to around 500 000 steps, the benchmark gradually claims more and more of the memory allocated to the JVM; garbage collection is taking place, but the maximum memory claim goes up in spikes. While this is happening, there is a high rate of disk writes, the rate of disk reads, while non-zero is for all intents and purposes so low at registers as zero in resource monitor.
Once the benchmark gets past 500 000 steps, the memory usage tops out with garbage collection periodically (and quite regularly) reclaiming very large amounts of the memory used by the JVM. No OOM errors are observed, all subsequent data structure changes fit nicely into the available memory. The disk read rate increases and the disk write rate drops, these settle down to more or less constant values. The rate of writing is significantly higher than the rate of reading.
This almost certainly due to the cache for proxies managed by 'ImmutableObjectStorage' filling up - it is a weak cache, so the JVM approaches the memory limit, the cache starts to purge entries. Once enough proxies are purged, subsequent attempts to resurrect them will cause tranche loading, thus explaining the disk read pattern.
At this point, we have the following:
While the linear coefficient of scaling is a whopping 80-90 milliseconds per step, this can be addressed separately as an optimization task; for one thing, the manner of how I/O is carried out could be finessed, more caching could be employed, ganged blob storage implementations, lazy loading of items, optimization of Kryo usage, etc. The point here is to achieve a reasonable trend. While there is still a quadratic coefficient, it could be argued that this is perhaps an N * log(N) dependency being modelled as a quadratic, or that there is still some characteristic of the benchmark that is causing this. There is still the possibility that the 'AllEventsImplementation' is causing this, as its internal hash map gets larger to accommodate more and more lifecycles - so the tranches get larger and once the benchmark settles down into its fully occupied memory behaviour, we would expect to see an increasing hit on tranche loads, even though the hash map is spread across several tranches.
Given that the hash map implementation uses trees that have a very high branching factor, I would imagine that this may not be a significant problem once enough revisions have been processed, in other words, the quadratic behavior should disappear once a sufficient number of revisions are made.
So rather than spend yet more time chasing this, the decision is to declare this issue complete, warts and all, and move on to more pressing work. Any further optimization / investigation work can go into its own issue, motivated by a more realistic use-case.
Ah - one thing, what happens nowadays if queries are omitted? There are results above for up to 1 000 000 steps, but this was before the benchmark was revamped. I should do that again...
Reopening to add discussion of subsequent work...
Removing queries revealed some residual non-linear timing dependency.
This was initially investigated by looking at the number of cached associations from objects to their reference ids, from reference ids to object proxies and from tranche ids to completed operations.
As of commit 1e1faddbfd02e9a1daa00c1c4a8effb16c1784f5, this revealed a steady increase in the number of object to reference id associations, as well as the number of reference id to proxy associations; in addition the number of completed operations (and objects and proxies) all exploded at around 250 000 revisions.
Number of cached object to reference id associations:
Number of reference id to proxy associations:
Number of tranche id to cached completed operations:
As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.
Note that in the previous graph for commit 1e1faddbfd02e9a1daa00c1c4a8effb16c1784f5, the caches were being periodically cleared in an attempt to force cleanup, this clearly wasn't working.
Commit af6f546a4a90f7111599b1a4992efb0d112de26e swapped out Caffeine for the Guava cache implementation, and also dropped the periodic forced clearance.
Number of cached object to reference id associations:
Number of reference id to proxy associations:
Number of tranche id to cached completed operations:
This has clearly fixed the explosion issue, but there is still the linear trend in cached objects and proxies, which is unacceptable if the memory usage is to be kept bounded.
The linear trend turned out to be caused by the caching of completed operation objects on storage by 'ImmutableObjectStorage'. Doing this holds 'fresh' objects that have been created in a session and not loaded from an existing tranche; even though successive stored tranches will correctly partition up the graph of live in-memory objects, the fact is that those in memory will never have been restored from a tranche, so no proxies are generated and thus as more revisions are made, the in-memory object graph gets bigger and bigger. Of course, reloading tranches due to previous cache eviction will mitigate this somewhat, but even then, requests for an inter-tranche reference can be resolved via a completed operation that contains fresh objects rather than reloaded ones.
Commit efc78bb73fb9716272401b1dcffbf156db1da3ef fixed this problem.
Number of cached object to reference id associations:
Number of reference id to proxy associations:
Number of tranche id to cached completed operations:
Given that the linear dependency coefficients are an order of magnitude less for the object and proxy associations, and are negative for the object associations and the completed object associations, it is assumed that these are due to the caching 'settling in' as a working hypothesis.
It has been confirmed by inspection using Visual VM that the memory footprint remains bounded, alternating between a roughly constant maximum and a roughly constant minimum due to garbage collection, looking at the results for commits from efc78bb73fb9716272401b1dcffbf156db1da3ef onwards.
However, the timings are still non-linear, even with queries omitted:
The focus moved to looking at the number of tranches loaded overall and per 50 revisions, the amount of tranche data loaded overall and the maximum size of tranche loaded to satisfy fetching an underlying object for a proxy.
What was immediately apparent is that the largest tranches loaded to satisfy fetching a proxy's underlying object are always due to the proxy being a hash trie map or more basic hash map, usually a trie.
A brief excursion was made post commit efc78bb73fb9716272401b1dcffbf156db1da3ef to see whether blobs were exhibiting any nonlinear trend; this turned out not to be the case:
Cumulative size of stored blobs :
Given that the size of a stored blob is roughly constant per revision, it cannot be the case that reading a blob can be the cause of the non-linearity.
Commit f476096ff74f9fae29ed7367cae788ca94573f4f yields interesting results.
Cumulative number of tranche loads (both direct and on behalf of proxies):
Cumulative size of tranche loads (both direct and on behalf of proxies):
Maximum size of tranche loaded on behalf of a proxy:
Number of tranches loaded per 50 revisions:
There is a rather pleasing logarithmic scaling to the maximum tranche size loaded on behalf of a proxy, and indeed the number of tranches loaded per revision also scales logarithmically, presumably because there is some balanced tree structure that requires a logarithmic number of nodes to be replaced by new nodes as a new tree is generated.
So the first two curves are not actually quadratic, they are simply best fits to a sum of logarithms plus constant terms, that is: a (N ln(N) - N) + b * N (using Stirling's approximation).
So, is this the origin of the non-linear time dependency? Again, this is commit f476096ff74f9fae29ed7367cae788ca94573f4f.
Time taken per 50 revisions, linear and logarithmic fits:
Apart from the slight change in the goodness of fit from 0.0941 to 0.0943 in favour of the linear fit, there is no evidence either way. I'd be wary of choosing one over the other given the order of magnitude of the goodness of fit, ~1e-1, not to mention that the difference is in the third significant figure.
So what can be said with confidence is that:
So, good, but tantalisingly inconclusive regarding the timings...
Screenshot of VisualVM showing memory usage at around 242850 revisions:
Note that the benchmark is heavily IO bound, CPU usage is much higher for the in-memory world implementation.
It looks like you're doing a really nice job at performance analysis.
As the Caffeine cache implementation clears the cache via a background thread, it was hypothesised that it might not be catching up with the workload, possibly due to its default thread pool being starved by other libraries.
You can set Caffeine.executor(Runnable::run)
to execute work on a calling thread, like Guava does. Since that work is fairly cheap, it should be fine and might help resolve your issues. We try to give good defaults for minimizing response latencies, but also be configurable as users know best.
It sounds like you found and fixed a possible memory leak (ImmutableObjectStorage
) which could have also been the culprit, such that the default executor would work fine. Since the cache is bounded by weakValues, I would have thought the caching libraries wouldn't have much of an impact as the heavyweight objects are managed by the garbage collector. Thus, the clean up work of Caffeine/Guava is only to discard these collected map entries, which has an insignificant memory/time overhead.
If possible, I wouldn't rely too heavily on the GC for bounding a large cache. It is a good strategy for smaller sizes, but may not be aggressive enough and cause performance issue due to longer GC cycles when under a load for large caches. A maximumSize / maximumWeight offers better predictability, lower GC cost, and often a better hit rate. However that might be hard since your proxies are loaded lazily so a weight cannot be calculated upfront when the storage object is retrieved. In that case you might consider having the proxies fetched through their own cache, which can then be sized appropriately, rather than retaining them on the storage object explicitly.
Improve performance of the persistent implementation introduced in #16.