rails / solid_cache

A database-backed ActiveSupport::Cache::Store
MIT License
891 stars 63 forks source link

Introduce ability to constrain cache storage by size #116

Closed skatkov closed 10 months ago

skatkov commented 11 months ago

To better support constrained environments with low storage for DB's (e.g. Heroku), it would be nice to constraint Solid Cache by size of cache table to ensure that cache will not consume all available space in database.

DHH mentioned this here: https://github.com/rails/rails/issues/50443#issuecomment-1871279699

djmb commented 11 months ago

We need a database agnostic way of doing this and there's no way to get the table size like that. Also at least for MySQL, the table size wont be reduced unless you re-write the table so even if you can get it, it is useless as a measure for cache size.

We could store the total size of the cache values in a separate table, but that's got issues:

  1. We need extra writes to adjust the cache sizes when upserting/deleting
  2. In fact we can't upsert any more, as we need to check for a previous value to get its size so we can adjust the total correctly on updates
  3. There would be contention on the row storing the total (could probably shard it across multiple rows to mitigate this though)
  4. There's no error correction. If someone manually deletes a bunch of rows or truncates the entire entries table, then the totals will be wrong and will stay wrong.

So I don't think that's a valid option.

A second approach is to store the size as a separate column on each row of the solid_cache_entries table. We could get the size by summing the size column. However that query would be much too slow once the cache gets beyond a certain size (e.g. on Basecamp we have about 900 million rows).

We could try to sample the size column to generate an estimate. If we assigned each row to a random bucket and created an index on bucket, size we could quickly read all the sizes from one or more buckets and generate an estimate from that.

I think that will quite often give a fairly accurate estimate. However large sized outliers could be a problem. If we also indexed the size column, we could read the N largest records and use that to adjust our estimate. Will need some thought into how we do that exactly.

This probably needs some testing to see how accurate the estimate is across different sets of sizes.

There's a couple of other things to note here:

  1. The size is not the size on disk but the size of the records. There'll be extra overhead for indexes, empty space etc. And that will be different based on the database and its settings. We could add a multiplier to try to account for that or just live with the difference and leave it to the user to account for any differences. We should account for the expected overhead when setting the default though.
  2. Reading more values will be slower but give us a better estimate. It will need some experimentation to see can we get accurate enough estimates quickly enough. Maybe we let this be configured, but we'd need good defaults.
  3. We can estimate how many records are in each bucket using max(id) - min(id), so we can guess how many buckets to query in one go to get N records back.
djmb commented 10 months ago

See https://github.com/rails/solid_cache/pull/139