rails / solid_cache

A database-backed ActiveSupport::Cache::Store
MIT License
890 stars 63 forks source link

Suggest to use hash index, instead of btree index in migration. #94

Closed skatkov closed 10 months ago

skatkov commented 1 year ago

Good day. Thanks for amazing gem.

I experimented with storing cache in a relational database as well. From my findings, it seems like a good idea to suggest using hash index instead of btree as a possible alternative.

Hash index doesn't have to descend through index to find a right entry, instead it efficiently finds cache entry with O(1) operation. This small improvement can yield 40-60% lookup speed improvement, but it depends on a size of cache.

This optimization doesn't bring many drawbacks, only one - #delete_matched method can't be used in this case. So I'm throwing an error, if hash index will be detected. But to be frank, not a lot of people use this method anyways - so it seems like a manageable drawback.

Just to add a little bit more context to this PR, I'm linking interesting benchmark that compares btree with hash index. https://evgeniydemin.medium.com/postgresql-indexes-hash-vs-b-tree-84b4f6aa6d61

skatkov commented 1 year ago

Upps, let me fix this :)

rafaelsales commented 1 year ago

IMO the added complexity in the installation command and keeping two migration files is a bit too much here.

How about ensuring that the hash index is the default one, and add a commented add_index line that uses btree advising on when to uncomment. E.g.:

    create_table :solid_cache_entries do |t|
      t.binary :key, null: false, limit: 1024, index: { unique: true, using: :hash }
      t.binary :value, null: false, limit: 512.megabytes
      t.datetime :created_at, null: false

      # Uncomment the btree index below if you want to use `SolidCache#Entry.delete_matched`
      # t.index :key, unique: true, index: using: :btree
    end
skatkov commented 1 year ago

Thanks, @rafaelsales I agree with your suggestion.

Just for visibility’s sake, I'll post @djmb response from slack

It might be that we could just decide not to support delete_matched at all

-In MySQL with InnoDB, there are no hash indexes - it’s allowed in the create index syntax but it just creates btree indexes. But we could reduce the key column size instead -I’d like to get to somewhere where there are no options but it’s as fast as it can be - so for Postgres that would mean probably your approach.

I’m away for a week or so, so I’ll leave your PR for now, but I’ve not forgotten it!

I'll wait for final decision from @djmb and when I'll fix this PR.

But it seems reasonable to completely remove support for #delete_matched and have only one migration that uses hash index in PG, but btree in MySQL.

jmonteiro commented 1 year ago

Small note on this, per Postgres' documentation:

Currently, only B-tree indexes can be declared unique.

So a hash index cannot have uniqueness enforcement.

skatkov commented 11 months ago

@djmb would be great to hear, which direction you want to take with "hash index" for PG.

I'll be happy to adjust this PR accordingly. Not sure, if there is a point to fix current implementation.

djmb commented 11 months ago

Hi @skatkov!

Sorry for the delay on this!

I'm planning to add a separate key_hash column which will be a 64 bit integer. This will allow a much smaller index for looking up records that works on all databases. The query for records would then be where key_hash = <hash> and key = <key>

Maybe we could still make it a hash index on postgresql, but that would depend on whether we need it to also be a unique index.

We could keep the unique index on key which we can use for delete_matched, or have a unique index on the new key_hash column instead and make the index on key optional for delete_matched support.

Also the performance difference between btree and hash indexes may also not be as significant for a 64 bit integer column compared to a 1K blob.

Another consideration is the changes needed to allow us to estimate the cache size. If those work, then the key_hash could potentially be used as the bucket column - you could do something like select sum(size) from solid_cache_entries where key_hash between x and y.

This would require an index on key_hash and size to be efficient, so that adds to the mix.

I'll be working on this over the next few weeks, but it needs some investigation. Once we have the columns and indexes in place for everything we can see where a postgresql hash index fits into the mix.

skatkov commented 10 months ago

I'll drop this for now. Happy to take another stab at this once @djmb will decide on a approach he wants to take here.