ukwa / webarchive-discovery

WARC and ARC indexing and discovery tools.
https://github.com/ukwa/webarchive-discovery/wiki
117 stars 25 forks source link

Add IDs for collections etc #200

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

Relying on textual facets for collections and subjects turned out to be a mistake. If we need to rename a collection (eg misspelled) the it forces a full re-index. We should add integers/IDs for these things so future UKWA-UI can switch to those and avoid re-indexing.

tokee commented 5 years ago

Fair enough, but what does that have to do with the webarchive-discovery code? If you want to use numbers as collection-identifiers, they are stored just as well as letters with the current code base!?

If you are thinking about adding integer fields, then I will advice against it: They are not integers conceptually (they are just IDs) and it will add clutter to the schema.

anjackson commented 5 years ago

Werl, to ensure a smooth transition (rather than having to redo everything all at once), I was planning to index both the integer IDs and the textual ones in separate fields. We already have collection_id and collection, but I'd like the same for collections and possibly wct_subjects.

tokee commented 5 years ago

So it's not a question of field type, but a question of semantics for the fields? collection is free-form, while collection_id is controlled? Wouldn't you normally have one or the other and thereby just use a single field, whatever it is named? I guess I don't understand the use case fully.

anjackson commented 5 years ago

Two better ideas appear (in OH-SOS):