sourcegraph / srclib

srclib is a polyglot code analysis library, built for hackability. It consists of language analysis toolchains (currently for Go and Java, with Python, JavaScript, and Ruby in beta) with a common output format, and a CLI tool for running the analysis.
https://srclib.org
Other
942 stars 62 forks source link

store: In memory cache for IndexedTreeStore #195

Closed keegancsmith closed 9 years ago

keegancsmith commented 9 years ago

IndexedTreeStore contains a few indexes used by most queries. In some repos these can be large and the time to read, gunzip and unmarshal can be costly. This commit adds a simple in-memory cache around the indexes. The places where we get and put from the cache are only done where the index is read only, so we do not have to concern ourselves with making the Indexes handle concurrent writes.

In a synthetic repo on my MBA some queries took 6s, most of that time was dominated by loading indexes in IndexedTreeStore. In particular I generated a repo with

$ srclib gen-data urefs -v -u 500 -f 10 -r 'tmp/genrepo11' --ndefs 10 --gen-source

then did the ListExamples operation in sourcegraph.

keegancsmith commented 9 years ago

I'm pretty noob at this part of the code base, so looking for good critique on architecture/go style/etc for this change. I tried to ensure that this caching doesn't result in any concurrent writes on the indexes, but may have missed something. That is the biggest risk in this change. Otherwise I now have sub 1 second ListExamples when used on srcgraph.

sqs commented 9 years ago

This makes it so that after an index is consulted, the entire index is cached in memory (in a global var) for the duration of the Go process—right?

This works, but I think there is a simpler solution that would also reduce resource consumption overall (and make it simpler to reuse connections used by the store). In the Sourcegraph code, in graphstoreutil, we instantiate a new FSMultiRepoStore (or similar) for each incoming request. We do reuse the HTTP client, IIRC, but otherwise the whole object hierarchy is recreated. If instead we were able to just reuse the same FSMultiRepoStore for all incoming requests, the indexes would remain cached in memory by virtue of the fact that once they have been read in and are Ready they are not purged.

This should take a lot less code and might not even require any changes to srclib itself, only to the Sourcegraph graphstoreutil package.

keegancsmith commented 9 years ago

@sqs Their are actually only two places right now in the Sourcegraph codebase were we do write operations on the graphstore we get from the context. That is easy to fix. However, on every query we always go through openTreeStores in tree_stores.go. That doesn't store the tree stores anywhere in memory, but is rather just returned. So we do require some sort of change to srclib to enable the performance improvements.

I'm not the happiest person about this change as it stands. I'd like your input on what you think is a better solution. This PR as it stands, or I can imagine adding storing of treeStores instead. In that case if an instance of FSMultiRepoStore (or similar) did a query which opened a lot of commits, or it did many different queries over time we would potentially balloon in memory. In Sourcegraph we can control that, but what about users of srclib outside of Sourcegraph? If so then maybe a cache would make more sense at the TreeStore level? I also their is an experimental library for weakrefs which may be of use.

sqs commented 9 years ago

We can assume that users outside of Sourcegraph would either use only ephemeral stores (that'd be GC'd) or they would be savvy users (and would know how to clear the cache or whatever). Does that help?

So we have 3 use cases:

Since the latter 2 are mediated through the Sourcegraph codebase, we have a lot of control over caching since it's our code.

Does that help give context? Your call as to what's best.