mbr / simplekv

A simple key-value store for binary data.
http://simplekv.readthedocs.io
MIT License
152 stars 50 forks source link

Add 'delimiter' keyword to 'iter_keys' #83

Closed JoergRittinger closed 5 years ago

JoergRittinger commented 5 years ago

I would like to use the feature that one can give the keyword delimiter to the store.iter_keys method (https://github.com/mbr/simplekv/blob/master/simplekv/__init__.py#L102).

Some implementations support this feature:

Others would allow an implementation for it:

For remaining stores one could implement a slow evaluation which iterates over all existing keys.

Are there opinions on this? I would welcome a short discussion about this feature. I am also willing to implement some stuff (don't know how good I can implement the feature for the db stores).

fmarczin commented 5 years ago

Since this would change the common API (making it less simple) and requires implementations for all backends, there should be a rationale, or a payoff that makes this worthwhile.

I guess the use case is enumerating prefixes without having to download all the keys, in order to later narrow down the enumeration of actual (full) keys by using the prefix feature. Is that right? Does that capture it, or would you want to add something?

Since this extends the common API, before merging this, we should have an implementation for each backend that supports it, as well as a fallback for those that don't.

JoergRittinger commented 5 years ago

Thx @fmarczin for your comment. Our use case is that we use a file like naming for some data structure: data-type-1/metadata.json data-type-1/run_timestamp=1232736/cluster=1/jds9f891h9f1hf0aihsf.parquet data-type-1/run_timestamp=1192839/cluster=2/adkfp3f2ifh2pdh29ffd.parquet

where we have 10^5 parquet files and a handful of "top level folders". I need a quick access of the data-type-1/metadata.json. Using iter_keys/keys with filtering afterwards takes serveral minutes. With azure for example I can use / as a delimiter and get a very fast result.

fmarczin commented 5 years ago

The use case seems very valid to me. Especially for network stores, the penalty for transferring all keys can become prohibitive. My concern is for backward compatibility and API consistency. When the delimiter feature is used, the Azure blob store API, for example, returns special placeholders among the list of keys. This breaks the promise that keys() and iter_keys() return, well, keys. Among keeping the API simple overall, the feature should be designed in a way that does not break these very straightforward expectations.

crepererum commented 5 years ago

Working on that. I try to keep the current API and introduce a new method.