Specify provenance data for Document

niklasl commented 7 years ago

We need to nail down the intended usage (i.e. "meaning") of collection, changedIn and changedBy in Document & LDDB. Do they represent original datasource and/or an LDP container, or something entirely internal? (Also, can the changed* and dependent logic be removed, if the Voyager two-way sync mechanism is to be removed?)

If they are not internal "scratch" data, they need to be modeled and put into the record description. (And become links to proper (collection) resources. See e.g. id, created and updated the Document class for how record description data is used by the system itself.)

niklasl commented 7 years ago

Proposal: separate collection into datasource and marcCategorization (which is a function of the data). Questions:

Does this satisfy the needs of the OAIPMH export and APIX export mechanics (which will need to be adapted to the change)?
Same thing with Voyager import...
Is changedIn now same as datasource?
Update the LDDB versions table with currently missing columns (to avoid losing information)?
Should we control changes in datasource from one version to another (and if so, how)?

niklasl commented 7 years ago

Ignore the notion of LDP containers until a real (client/import) need arise?

niklasl commented 7 years ago

Requirements:

changedIn/datasource can change over time (when data sources change from import processes to editing through interfaces, and sometime "back" if resources are to be batch-processed).

Examples of sources: the current Voyager MARC subsets (auth, bib and hold), the 7 or so sets of resources in the definitions repository, import batches or other systems (bibdb, smdb?).

There can only be one datasource at a time for a record.
collection or similar is to be logical and generally fixed, based on the (base) type and/or categorization of the described resources within.

There is a common correlation between that and "source", and sometimes their "nature" is bound to their source (e.g. "LCSH", "SAO" and "SAB"). It can still be computed (as in derived from e.g. type or origin), and may be more of a tag mechanism for systems to simplify processing of richer data. (Cf. LDP indirect containers)

It is conceivable that a record belongs to more than one collection.

Based on the varying requirements of stability, the differences need to be considered and the different needs have to depend on the right one (a logically fixed one, based on description content, is commonly needed for e.g. MARC exports, whereas a varying maintenance mechanisms (batch imports and deletions) require more dynamic lookup and changes (based on description management)).

(Also, a paged collection can be dynamically defined based on simple searches, as is done for e.g. items of an instance held by a specific library (find?itemOf.@id=<instance>&heldBy.@id=<library>). (Though, alas, this is done differently in the web view layer and the OAI-PMH view layer). This can include the above notions as well, if they are part of the actual data about the record (and not just internal strings in the lower data layer).)

We need to avoid letting our internal mechanisms proliferate when a few core notions can satisfy a wide range of related needs. The collection and changedIn properties can be formalized and managed as RDF data about a record (a "gbox"), and thus utilized for the current and conceivable scenarios ahead. By defining these as proper resources, we can also state provenance data such as date, responsibility and documentation of e.g. purpose.

mxtthias commented 7 years ago

Just a note to remind us that whatever solution we implement for collection, the concept occurs in a whole bunch of places, like the importers app in librisxl and Document in whelk-core.

libris / whelk-core

Specify provenance data for Document #2