Add IDs for collections etc

ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.

https://github.com/ukwa/webarchive-discovery/wiki

117 stars 25 forks source link

Add IDs for collections etc #200

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

Relying on textual facets for collections and subjects turned out to be a mistake. If we need to rename a collection (eg misspelled) the it forces a full re-index. We should add integers/IDs for these things so future UKWA-UI can switch to those and avoid re-indexing.

tokee commented 5 years ago

Fair enough, but what does that have to do with the webarchive-discovery code? If you want to use numbers as collection-identifiers, they are stored just as well as letters with the current code base!?

If you are thinking about adding integer fields, then I will advice against it: They are not integers conceptually (they are just IDs) and it will add clutter to the schema.

anjackson commented 5 years ago

Werl, to ensure a smooth transition (rather than having to redo everything all at once), I was planning to index both the integer IDs and the textual ones in separate fields. We already have collection_id and collection, but I'd like the same for collections and possibly wct_subjects.

tokee commented 5 years ago

So it's not a question of field type, but a question of semantics for the fields? collection is free-form, while collection_id is controlled? Wouldn't you normally have one or the other and thereby just use a single field, whatever it is named? I guess I don't understand the use case fully.

anjackson commented 5 years ago

Two better ideas appear (in OH-SOS):

Use dynamic field for this transition field.
Mix the values in the field and switch based on the value? Not sure how faceting will work? Presumably facet on 'String' OR 'Int'.