Closed keegancsmith closed 9 years ago
I'm pretty noob at this part of the code base, so looking for good critique on architecture/go style/etc for this change. I tried to ensure that this caching doesn't result in any concurrent writes on the indexes, but may have missed something. That is the biggest risk in this change. Otherwise I now have sub 1 second ListExamples
when used on srcgraph.
This makes it so that after an index is consulted, the entire index is cached in memory (in a global var) for the duration of the Go process—right?
This works, but I think there is a simpler solution that would also reduce resource consumption overall (and make it simpler to reuse connections used by the store). In the Sourcegraph code, in graphstoreutil, we instantiate a new FSMultiRepoStore (or similar) for each incoming request. We do reuse the HTTP client, IIRC, but otherwise the whole object hierarchy is recreated. If instead we were able to just reuse the same FSMultiRepoStore for all incoming requests, the indexes would remain cached in memory by virtue of the fact that once they have been read in and are Ready
they are not purged.
This should take a lot less code and might not even require any changes to srclib itself, only to the Sourcegraph graphstoreutil package.
@sqs Their are actually only two places right now in the Sourcegraph codebase were we do write operations on the graphstore we get from the context. That is easy to fix. However, on every query we always go through openTreeStores
in tree_stores.go
. That doesn't store the tree stores anywhere in memory, but is rather just returned. So we do require some sort of change to srclib to enable the performance improvements.
I'm not the happiest person about this change as it stands. I'd like your input on what you think is a better solution. This PR as it stands, or I can imagine adding storing of treeStores
instead. In that case if an instance of FSMultiRepoStore
(or similar) did a query which opened a lot of commits, or it did many different queries over time we would potentially balloon in memory. In Sourcegraph we can control that, but what about users of srclib outside of Sourcegraph? If so then maybe a cache would make more sense at the TreeStore level? I also their is an experimental library for weakrefs which may be of use.
We can assume that users outside of Sourcegraph would either use only ephemeral stores (that'd be GC'd) or they would be savvy users (and would know how to clear the cache or whatever). Does that help?
So we have 3 use cases:
src store
CLI commands, other one-off tasks)Since the latter 2 are mediated through the Sourcegraph codebase, we have a lot of control over caching since it's our code.
Does that help give context? Your call as to what's best.
IndexedTreeStore contains a few indexes used by most queries. In some repos these can be large and the time to read, gunzip and unmarshal can be costly. This commit adds a simple in-memory cache around the indexes. The places where we
get
andput
from the cache are only done where the index is read only, so we do not have to concern ourselves with making the Indexes handle concurrent writes.In a synthetic repo on my MBA some queries took 6s, most of that time was dominated by loading indexes in
IndexedTreeStore
. In particular I generated a repo withthen did the
ListExamples
operation in sourcegraph.