sourcegraph / srclib

srclib is a polyglot code analysis library, built for hackability. It consists of language analysis toolchains (currently for Go and Java, with Python, JavaScript, and Ruby in beta) with a common output format, and a CLI tool for running the analysis.
https://srclib.org
Other
942 stars 62 forks source link

Support caching build/dep directories #93

Open sqs opened 9 years ago

sqs commented 9 years ago

Dependency resolution and some other build tasks take a long time but essentially do the same thing every time. If toolchains could write to a directory tree that would be present at future invocations, they could be made to run a lot faster.

Examples:

Prior art:

xizhao commented 9 years ago

Related: #65

If each repo could have a "src unit" or "src module" cache, you could identify individual dependencies by the unique URLs.

The challenge I see here is that each toolchain often leverages default package managers to do dependency resolution. 1. each PM has to leverage caching. 2. the directory structure has to be compatible with the PM and is sometimes non-configurable. In this sense things are very tc-dependent and maybe solving it at the srclib level is the wrong place to start.

Maybe what srclib could do is leverage this if containers are being run in docker. When a tc announces its depresolve step, srclib could commit the state, name it something like src-depresolve-<REPOID>-<COMMITID> (where REPOID is a hash of say the URL + name of repo), and then pick up that image and retag upon the next build. Old build data would truly be cached for each respective tc, and yet each tc would still be sandboxed from each other as only the relevant build data for each tc will be stored in the forks of the docker image. That means reviving an independent depresolve state means picking up an old image and just running the depresolve step again, assuming the tc leverages stateful caching.

My first guess:

srclib-javascript would work fine as node_modules would be cached srclib-python wouldn't leverage this as it tells pip to download to tmp folders that are cleaned up after each execution