Improve the performance of cache checks

tl;dr

This vastly improves the performance of elm-review given a full cache.

On my work codebase, this PR brings the analysis part down from 15-18 seconds to below 2 seconds :tada:

Some context

I was disappointed by how the cache performed in the latest release of elm-review (since v2.11.0 for the package and v2.9.0 for the CLI), and I got it to be much, much faster!

On my work project, with an empty cache the analysis part of the process (not the file loading and all that) took somewhere around 35-45 seconds (I'd have to check, but that's the ballpark). With a full cache and no changes, the analysis part took 15-18 seconds, which is pretty good but not amazing, and the gain were offset a bit by the loading (~3-6 seconds, off the top of my head) and writing of the cache (~2-4 seconds). So all-in-all: worth having but not as great as expected.

This PR now brings this down to below 2 seconds under the same conditions.

Explanation

The way that we checked that the cache could be reused for a module (or for the final evaluation and data extract) is by comparing the input project context with the one we have in the cache. If they're the same then we can reuse the results and skip the analysis of the file.

The thing is that because the project contexts are very large pieces of data, we don't want to store them as-is in the file-system cache, as that would blow up the file's size (that was my MVP, and some files were 400MB large :sweat_smile:). We stored the "output" project contexts (that is the result of the analysis) as-is though, as we need them to initialize the project context of a module that needs to be analyzed. But the "input" context cache is only necessary to check whether we can re-use the cached results.

So to minimize the space that we use for the input project context, we hash them to transform them into plain numbers/strings (through some JS-hackery: hash(JSON.stringify(context))). And this stringification+hashing is actually very costly because of the size of the data.

The way that we construct the input context is by folding all the output project contexts from the imported files. Meaning that any time we want to check whether a file can be skipped, we fold those together and we hash the result, which we then compare with the cache key. Needing to do this quite expensive computation was the reason that the analysis was slow, even with the cache.

So what I ended up doing is to change the cache key: instead of it being the hash of the folded context of all imported modules, the cache key is now the list of hashes for each imported module's output project context. The check is slightly less performant (we now compare 2 lists vs comparing 2 strings/numbers before) but that is nothing compared to hashing the context.

So now instead of combining the project contexts and hashing it, we now only hash the output project context of an entry after we've analyzed it.

This will potentially make the cache slightly more brittle: 1) If a module's output context gets changed after a re-analysis to a new value but that doesn't have an impact on the folded project context, then we will now re-evaluate the module anyway. 2) The order of modules might change from one run to another if the import graph changes slightly, then the order of things might be slightly different). To remedy this, I'm currently sorting the list of hashes. We can probably find a different method (for instance a Dict ModuleName (ContextHash projectContext)) but this will do for now.

Facepalm

The sad thing is that this was the way the cache worked before (when it was only an in-memory check and there was no need to hash anything). And sometime during the process I thought "well, they should be equivalent. Let's switch it so that we only have a single comparison to do" (and it would help with point 1) above). Big mistake :sweat_smile:

jfmengels / elm-review