parcel-bundler / parcel

The zero configuration build tool for the web. 📦🚀
https://parceljs.org
MIT License
43.27k stars 2.26k forks source link

Rust - LMDB Cache back-end #9827

Open yamadapc opened 2 days ago

yamadapc commented 2 days ago
devongovett commented 1 day ago

I don't think we need a database for this. We don't need any of the features that a database offers, e.g. durability, fault tolerance, transactions, indexes, etc. We don't care about data loss - a cache can simply be rebuilt from scratch if something goes wrong. All of those features have significant overhead and added complexity.

The history here is that we started with a file system based cache in Parcel. We used this cache not only as an actual cache, but as temporary storage within a build to pass data from one phase to another (e.g. transformation to packaging), which may be on different threads. In the JS implementation of Parcel, this was necessary due to the lack of shared memory between threads, as well as v8 heap size limits that we ran into while building large apps. We eventually moved to LMDB because it was faster than the FS cache due to batching, but avoiding disk IO during builds entirely would be much faster still.

In Rust, neither of these are problems anymore. We can keep as much as we want in memory (even more than the total system physical memory if necessary - virtual memory is already paged by the OS), and we don't need to write to disk in order to share memory between threads. Avoiding writing to disk until Parcel shuts down (or maybe during idle time) will be much faster - I think your benchmarks show the cost well. This is worth doing because the common case is a dev server that does many builds over its lifetime, none of which need to be written to disk, so we should optimize that dev loop. Not only is this faster, it's also simpler to implement, and avoids problems of inconsistent state that could occur when data structures are sharded.

yamadapc commented 1 day ago

We might care about transactions / consistency on some types of entries, but generally I agree that most of the time we just want to write the transformation outputs (or other intermediary assets) into files. That's anyhow the way to support remote caching.

To do that, we'd need individual cache entries for each of these intermediary assets.