qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.11k stars 66 forks source link

Use Bloom Filter as a fast path to check for public dataset existence #1899

Open b5 opened 3 years ago

b5 commented 3 years ago

I'd like to propose an extension interface aimed at improving the performance of time-to-not-found on a dataset alias that doesn't exist. Right now if I run qri get some_user/random_gibberish_dataset_name, all qri resolvers have to scan their entire list of datasets to confirm such a name doesn't exist. Keeping in mind that qri get is by default hooked up to a network-backed resolver, that means an HTTP call to a database scan per bad name request.

We can speed this up with a bloom filter hydrated with all known public names, or with specific SQL queries on sql-backed name stores. In either case, I'd like to define an interface resolver implementations can use to confirm or deny the existence of a name:

package dsref

// PublicAliasChecker is a fast-check for the existence of a public username / name combination
type PublicAliasChecker interface {
    PublicAliasExists(ctx context.Context, username, name string) (bool, error)
}

Using these techniques we can dramatically drop the time required to rule out a dataset existing.

I think we should define this as an interface in the dsref package to assist in a long term migration toward a single resolver definition backed by different name-store implementations. Currently every resolver needs to re-implement what is effectively the same pattern, just specialized to different stores. But our codebase is trending toward defining uniform store interfaces and business logic that operates on those stores. If we apply that pattern to the ResolveRef interface we should see a Namestore interface emerge at some point that allows us to unify ResolveRef into single function that operates on any given namestore.

We need to define this on public names only, because which unlisted / private datasets a user can see will be dependent on who they are, which will be a different issue. But that does mean a name resolver can only return dsref.ErrNameNotFound when all public names and the user's accessible list of private names are checked.