w3c / webextensions

Charter and administrivia for the WebExtensions Community Group (WECG)
Other
578 stars 50 forks source link

Proposal: Retrieving, deleting storage items using a filter. #505

Closed polywock closed 2 months ago

polywock commented 6 months ago

Why it's Needed (Efficiency)

I regularly need to use storage keys that involve dynamic elements like flag:[tabId]. The only way to get all keys starting with the prefix flag: is using StorageArea.get() to get all the storage items and then to filter.

This isn't very efficient. The storage areas support 10MB of data and even more with the unlimitedStorage permission. That's potentially a huge amount of storage data that needs to be serialized and sent over. There's also a special concern when using the Sync storage area.

To delete all keys with prefix flags: is even worse as it requires a three step process. Get all storage items, filter for keys you want to delete, and use StorageArea.remove to delete them.

Why it's Needed (Organization)

This proposal allows extension developers to group, retrieve, and delete a collection of related keys.

```ts browser.storage.local.get(['content/a', 'content/b', 'content/c', 'content/d', 'content/e', 'content/f', ...]) // becomes browser.storage.local.get(['content/*'], true) ```

Proposal

A way to retrieve and delete storage items using a filter. The filter can be a list of globs or an object. Preferably the same filter should be used for this proposal and #475.

type StorageFilter = Glob[] | FullStorageFilter

interface FullStorageFilter {
    matches: Glob[],
    excludeMatches: Glob[] 
}

Usage if new methods are introduced.

browser.storage.local.getUsingFilter(['flag:*'])

browser.storage.local.removeUsingFilter(['flag:*'])

Usage if existing methods are used.

The second parameter true indicates we're using filters.

browser.storage.local.get(['flag:*'], true)

browser.storage.local.remove(['flag:*'], true)
fregante commented 6 months ago

I'd rather keep the existing API and just pass a regex. Alternatively globs would be flexible and match existing webext matching.


browser.storage.local.get(/^flag:/)
browser.storage.local.get({match: 'flag:*'})
polywock commented 6 months ago
browser.storage.local.get(/^flag:/)

Looks great, but the other one isn't backward compatible. Whichever format it ends up being, it should probably be consistent with proposal #475.

fregante commented 6 months ago

but the other one isn't backward compatible.

Ah you're right, I forgot objects are used to provide defaults. {glob: true} could be the last parameter then, while the first one remains a string/object as usual.

tophf commented 6 months ago

The simplest solution would be to add getAllKeys(lowerBound?:string, upperBound?:string) method.

However, chrome.storage API was never designed to be performant and flexible, so maybe a better solution would be to add chrome.indexedDB which would expose the full power of IndexedDB in all parts of the extension. Firefox already uses IndexedDB internally for chrome.storage anyway, so it's indeed a viable solution. It could be also a subset of IndexedDB named as chrome.db, but I don't see why not expose the full API as the hardest thing to do is probably Promisification.

regex

JS RegExp won't work in Chrome as it only supports RE2 syntax inside the browser processes for the sake of guaranteed performance due to the absence of backtracking. It may be worse in other browsers, e.g. Safari supports a very limited/nonstandard subset of RE for a similar reason in declarativeNetRequest.

match pattern

would work for most cases I guess, but it's too limited, so ideally it should be extended with ranges like [a-f].

polywock commented 6 months ago

Since regex isn't an option. Out of all the suggestions so far, I think a match pattern suggested by @tophf (in #475) and @fregante is the most well rounded. I have updated my proposal to reflect this.

Rob--W commented 6 months ago

There is little appetite for the proposed API, but the browser vendors are willing to consider adding a new method to retrieve all known storage keys (return value: a list of strings). This would enable you to implement the filtering yourself, and then pass the array of keys to get/remove (or a new object with set if the goal is to bulk-replace values).

What are your thoughts on this counter-proposal?

tophf commented 6 months ago

A dumb getAllKeys() is the bare minimum, but it'll still be wasteful when there are many keys, so it's best if we can specify lower/upper bounds for the keys just like IndexedDB's getAllKeys or something similar.

polywock commented 6 months ago

@Rob--W I'm supportive, but ideally both should be implemented. Some advantages for a filter based approach.

  1. It's truly asynchronous, no need to wait for getKeys() + filter, before making your actual request.
  2. The browser can filter more efficiently than runtime javascript.
  3. Less payload that needs to be serialized. Instead of including 100s of key names, you can include 'prefix/*'.
  4. Consistency with #475. If that proposal is implemented, out of the box, it would make sense to also retrieve items using the same filter.
  5. Less payload should also benefit the Sync storage area. I don't know how that works internally, but if the keys actually need to travel to a server, it would be far more efficient to use a filter than a huge list of matching keys.
oliverdunk commented 5 months ago

@polywock, to clarify - if only getKeys() was implemented, would you be a potential consumer for that? Or would that be too limited and you would workaround this in other ways? We discussed some of the the points you made in the meeting but ultimately we preferred the simpler getKeys() API.

fregante commented 5 months ago

I'd definitely use it in my module to only fetch items that belong to it ("starts with cache:")

https://github.com/fregante/webext-storage-cache/blob/d6f25cb45989ac947a9465724dfec1a4d7c550f8/source/legacy.ts#L94

I think it's a fair compromise, if the filter for .get() isn't accepted. However I wonder what browsers gain from this dual API since it feels like it halves the performance (two lookups)


I also have another module that populates/autosaves the options page into the storage, given a schema. It currently uses a single key and the options are stored as a single object.

With this API, I would be able to store one item per field, without losing the core features (like the ability to migrate keys that are no longer in the schema)

tophf commented 5 months ago

what browsers gain

Simplicity. They want to keep this API simple. A getAllKeys method was always a necessity though, so it's a good compromise for them, whereas we'll still get the benefit of not reading huge values.

polywock commented 5 months ago

@oliverdunk It's not my preferred method, but getKeys() will be very useful.

hanguokai commented 5 months ago

Background

Because browser.storage api doesn't support multiple namespaces (databases), as a result, developers need to put all kinds of data for different purposes in one space. A common solution is to set a prefix for keys to distinguish different types of data.

The original intent of #475 was to support multiple space. Considering that browsers may not support this change, a compromise was made. If browsers supported multiple spaces, this proposal #505 would not exist.

So the proposal #505 itself is a product of compromise. Now, getAllKeys() is a further compromise. Developers need to write another wrapper function, e.g. getDataByPrefix(prefix). It helps, but not directly. And this is mainly useful when values are large objects. If the data is small, developers can just read all the data directly.

polywock commented 5 months ago

The original intent of https://github.com/w3c/webextensions/issues/475 was to support multiple space.

Since I have long since edited my 475 proposal. For those who don't know, in 475 I had two proposals. The primary proposal was to support buckets/spaces for the storage API, and allow retrieving, deleting from a specific bucket, or listening to changes on a specific bucket. From the feedback, I switched focus to the secondary proposal (StorageArea.onChanged supporting a filter). I wouldn't say this was a compromise since I included the secondary proposal alongside the main proposal. I prefer the secondary proposal as of today, especially if paired with #505 (or even the counter-proposal, getKeys).

All things considered, I'm still satisfied a counter-proposal was accepted. It won't be as efficient as the original, but it will still allow for getting filtered keys more efficiently. It would be interesting to do benchmarks to figure out at what point it's better to use getKeys() + filter + get(filtered keys) compared to the current get() + filter.

In addition, if using getKeys() to filter becomes common, the browsers might later implement a dedicated function to optimize it.

fregante commented 5 months ago

out at what point it's better to use getKeys() + filter + get(filtered keys) compared to the current get() + filter

My guess is that it has limited use for storage.sync since you're reading 100KB at most from disk. Also a micro-optimization for storage.session since reading from RAM is extremely fast.

That leaves storage.local, particularly with unlimitedStorage.

In any case, the most common usage would probably look something like this:

const cache = await chrome.storage.local.get(await chrome.storage.local.getKeys('cache:*'))
await chrome.storage.local.remove(await chrome.storage.local.getKeys('cache:*'))
hanguokai commented 5 months ago

@fregante getKeys() doesn't have parameters, it returns all key in the storage, then filter key by yourself. Another, storage.session is not as fast as you think in current Chrome implementation. Considering OS disk cache, storage.local is almost as fast as storage.session.

PS: I didn't say getKeys() is not useful. It is useful if you have a lot of data.

oliverdunk commented 5 months ago

I had a chance to speak to the engineering team about this. On the Chrome side, we're supportive of a getKeys() method to get all keys from a given storage area. We would also be supportive of a deleteKeys() that takes multiple keys to delete if there is interest from other browsers.

tophf commented 5 months ago

We would also be supportive of a deleteKeys() that takes multiple keys

This is already implemented: chrome.storage.XXXX.~delete~remove() accepts an array of keys.

oliverdunk commented 5 months ago

This is already implemented: chrome.storage.XXXX.delete() accepts an array of keys.

Do you mean remove()? It looks like you're right though - in which case we definitely shouldn't duplicate that.

polywock commented 2 months ago

Closing in favor of #601.