mozilla / mentat

UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.
https://mozilla.github.io/mentat/
Apache License 2.0
1.65k stars 115 forks source link

Investigate methods of reducing the on-disk size of Mentat databases #811

Open thomcc opened 6 years ago

thomcc commented 6 years ago

This document highlights many of the issues and concerns https://docs.google.com/document/d/14ywV4PlBAdOsJrxd7QhMcxbo7pw8cdc1kLAb7I4QhFY/edit?ts=5b7b7f87.

Some work to measure the size of the database when storing history is in https://github.com/mozilla/application-services/pull/191. For my places.sqlite (100k places, 150k visits), it gets around 200MB larger.

It's worth noting that 'disk usage' was one of the primary concerns reported by user research for Fenix (although it's not clear if this size increase (relative to places) is the kind of thing that would make a dent relative to stuff like caches and the like -- A very informal poll of some friends of mine found that Fennec typically uses around 500MB of space (app + data), another 200MB isn't a trivial increase, but doesn't substantially change where we are in terms of app size).

Some bugs which may help (suggested by @rnewman):

I think something like sqlite's zipvfs extension would likely help (as the databases compress well), but have not tried it. Implementing it ourselves is likely beyond the scope of this effort (I took a look at the effort required and it wasn't exactly trivial). Additionally, whatever we do would need to somehow integrate with sqlcipher (I also took a look at bolting compression into sqlcipher before the encryption, but the fact that this makes the block output a variable size seemed to make this problematic).

Other notes:

  1. Storing strings as fulltext and using the compress/uncompress options of FTS4 did not help, since the strings in each column are relatively small. Additionally, the performance overhead here was substantial even for a very fast compressor (LZ4).
  2. Most string data seems to be duplicated ~4 times, in datoms, timelined_transactions, and in the indices idx_datoms_eavt, idx_datoms_aevt.
  3. During RustConf, @rnewman suggested that ultimately mentat will likely not want to use sqlite, and instead want to read datoms chunks directly out of something like RKV. These chunks could be compressed more easily. This seems out of scope, as it would be a massive change to mentat, but is worth writing down.

Additional concerns exist around the fact that this problem may be exacerbated by materialized views (perhaps #33 will help or prevent this?)