ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
113 stars 24 forks source link

Add collections to field that can be updated atomically #271

Open anjackson opened 2 years ago

anjackson commented 2 years ago

Currently, collections are stored as strings in multivalued fields. This has a couple of problems. Firstly, really, the string version should be translated in the UI, and we only need to store integer IDs for collections.

More importantly, the current model requires full document re-indexes if the Collections are updated. It would be better to store the collection in fields that meet the criteria for atomic, in-place updates (see In-Place Updates). This would allow collection membership to be updated without costly full re-indexing.

The main limitation is that these fields have to be single-valued. If URLs can only belong to one collection, or have a 'primary collection', then this works fine. But in general we want multiple collections, so as a workaround, we can use dynamic fields something like:

collection_1_id_i: 231
collection_2_id_i: 214
...

Then, at query time, we facet on all collection_*_id_i values (and likely have to enumerate and merge these facets client side?).

This needs to be tested from the client end to check it's workable. I think we may have to enumerate all the facets separately, so in practice we'll have a limit of e.g. 6 collections an item can belong too?

EDIT The rights field access_terms should also be an integer rather than a string to, so this can be changed. Same for any subject fields.

tokee commented 2 years ago

Updating is not trivial as one needs to extracts the collections for a document first, so that the next free collection-field can be determined. But I have no better idea than yours: By limiting the number of collections to 64, they could be stored in a single long, but that would require more front end code to unpack and the number of unique values when faceting is potentially enormous.

anjackson commented 2 years ago

We can store them in a long, but I couldn't see a way to facet on bits? Maybe I missed something?

tokee commented 2 years ago

You can't facet on bits in longs (well, one could build a special processor for it, but that would be tedious to maintain). But you could post process the facet result and do the tallying of the individual collections there. But again: I prefer your solution. I'm just thinking out loud here.

anjackson commented 2 years ago

Ah gotcha. And you're right, the updating will be tricky.