phiresky / sqlite-zstd

Transparent dictionary-based row-level compression for SQLite
GNU Lesser General Public License v3.0
1.38k stars 49 forks source link

Comparison to compression VFS #9

Open phiresky opened 1 year ago

phiresky commented 1 year ago

Something like https://github.com/mlin/sqlite_zstd_vfs. Pretty different in the tradeoffs and design decisions, but still should be compared to.

evbo commented 1 year ago

Any thoughts on SQLite's official CEROD extension? I know it's commercial but the source is public, so as long as you did an academic study I think that'd be legal?

If not, do you know anecdotally why that extension might not be a great choice for comparison? Does it have too much performance penalty? Is read-only requirement too restrictive?

phiresky commented 1 year ago

I didn't know CEROD was source-available. I don't see any obvious documentation about how the compression aspect of CEROD works, but I'll assume it's similar to sqlite_zstd_vfs. First of all, read-only sounds extremely restrictive to me.

The main potential disadvantage of the VFS approach is that all data is compressed in the same way. That means that even the often-accessed parts (e.g. B-Tree inner nodes) are compressed which might be bad for performance. It also means there's no way to compress different data differently. For example, with my extension you can compress one table with high compression level, another with low compression level (resulting in higher write performance), and another one not at all. You can also flexibly choose the method for deciding which data is compressed together or at all with an arbitrary SQL expression, which is advantageous for e.g. a hot-set vs archived-data situation.

Another advantage of sqlite-zstd is that it can be incrementally enabled without rewriting the rest of the database and without affecting the storage layer (how a database is opened / written to) at all.

The main disadvantage is that this extension is more complex than e.g. sqlite_zstd_vfs and not 100% transparent to the application, and that only data can be compressed, not the B-trees / indexes, which can reduce the effectiveness.

I'm pretty sure it's possible to construct both examples where my extension is faster and more storage efficient than VFS-based alternatives, as well as where it is worse. But it's hard to say in general which will be the case for a real-world application.

evbo commented 1 year ago

In my case, it boiled down to VFS applied to a database of mostly ints (no blobs or big varchars).

If I had blobs or big varchars, I would use yours :)

solerman commented 1 year ago

Give your original idea in the blog,there is some overlap with time-series db (or timeseries db extensions, there are a few for postgresql eg https://github.com/timescale/timescaledb)