Open tiborsimko opened 9 years ago
@glouppe: here is detailed how we are going to store relations across records.
@tiborsimko:
- an API to list all depending authority records;
- an API to enhance record JSON by dereferening JSON $ref pointers among authority/bibliographic records;
- an API to update all dependent bibliographic records (in Elasticsearch) following an authority record change;
- an API to change all dependent bibliographic records (in PostgreSQL) to point from one authority record to another;
- an API to remove authority record and update its dependencies (in PostgreSQL);
These APIs are going to be the same also in the context of non authority records. E.g. same needs for dealing with cited-by
, with superseded-by
, etc. I think these APIs would be would more naturally belong to invenio-records
.
- an API to change all dependent bibliographic records (in PostgreSQL) to point from one authority record to another;
- an API to remove authority record and update its dependencies (in PostgreSQL);
Note: these 2 would imply also re-sending updates to Elasticsearch.
These APIs are going to be the same also in the context of non authority records.
Indeed, to a large extent only the look-up of related records would differ. (Touching also JSON Hyper-Schema.)
Note: these 2 would imply also re-sending updates to Elasticsearch.
Yes, that's happening by default with any update done to any record in PostgreSQL.
Note: these 2 would imply also re-sending updates to Elasticsearch.
Yes, that's happening by default with any update done to any record in PostgreSQL.
Yes, but in the case of PostgreSQL, with JSONB, we could in principle update all references across all records with one query. In this sense it's not for free the signal to elasticsearch (which is instead there if you overwrite one record at a time with one full blob ala bibupload). For performance reasons would be great if, under the hood, these APIs would exploit JSONB when possible.
These APIs are going to be the same also in the context of non authority records.
Indeed, to a large extent only the look-up of related records would differ. (Touching also JSON Hyper-Schema.)
So will these be part of invenio-records
? At worst, you could call this very module invenio-relations
, and build on top of it invenio-authorities
.
invenio-records
should stay very slim, only executing the orders it is given. It would be more advantageous if an upper layer does the intelligent record merging business. (which may or may not require human decisions...)
A few open questions:
Right. On the other hand invenio-authorities
, as defined above, does two steps too much. Would you recommend using invenio-authorities
to handle citations? Wouldn't be strange to call APIs from invenio-authorities
in order to update all cited records?
With pointers from records to authors, do we also need to add explicitly back reference pointers from authors to records? (and maintain those upon changes?)
Wouldn't they be retrievable with a query? (e.g.: give me all the records that match authors.__authority__.$ref == "http://inspirehep.net/record/123
)
So @tiborsimko how do we reach a consensus on this WIP?
cc: @jmartinm, @jalavik (as from this depends how we amend the data model for relations)
Background
In Invenio v1, the bibliographic records (in MARC) could point to authority records (in MARC) via
$0
referencing technique, pointing to a certain authority database with a certain authority ID. For example, the bibliographic record would contain:and the authority record would contain:
This permitted for example to:
and more.
The behaviour was configurable via CFGBIBAUTHORITY* variables see for example BibAuthority Admin Guide.
In Invenio v2, the
invenio-authorities
module should reproduce this functionality for general non-MARC records.Storage
In Invenio v2, the records are described as JSON snippets that match certain JSON schemas. The reference between a bibliographic record and an authority record can be done via standard JSON reference and JSON pointer techniques.
For example, the bibliographic record may look like:
and the authority record may look like:
One of the advantages of using
$ref
pointers is that one can natively point either to a local authority record or a remote authority record, say CERN Open data pointing to an author record in INSPIRE.The "master records" would be stored "as is", including
$ref
. The dereferencing would be done later, see the section about "indexing" below.Note that master records can user nicer URI schema, such as:
and this even for Invenio-managed authority sources, as we are progressively dropping (1) constraint for using the same
/record
namespace for all records and the (2) constraint for using only numeric record IDs.Finally, note that the internal
__authority__
convention technique permits to distinguish between a field that is under the authority control, but has other local values, and a field that is purely referencing another field, such as:and:
because
$ref
must be a single member of the JSON reference. Hence the usefulness of the__authority__
convention in this document, thought the final implementation may differ. (See also JSON Hyper-Schema.)Indexing
Records that are stored in JSON in PostgreSQL can be "enriched" or "enhanced" before they are sent to Elasticsearch for information retrieval processing. (E.g. Pythonic tokenisers can add alternative terms before custom Elasticsearch analysers get implemented.) It is at this "enhancing" stage that the authority module would dereference
$ref
pointers.The dereferencing is complete, i.e. all pointed information is included. In case of performance of information separation concerns , the dereferencing can be partial, i.e. only parts of pointed information are included via selective JSON pointer technique (
/foo/bar
).If the reference is local, the dereferencing step can be done locally inside PostgreSQL via local JSONB queries. If the reference is remote, the standard JSON dereferencing can be used.
Elasticsearch has now access to the enriched JSON and can therefore include additional information in index configurations.
Displaying
Since record display is being done from the enriched JSON that is coming from Elasticsearch, the output Jinja have direct access to the referenced authority information, and can therefore directly display wanted fields.
Management
invenio-authorities
module is a layer aboveinvenio-records
that stores the record themselves.invenio-authorities
offers API to work with__authority__
governed records and fields in both directions:$ref
pointers among authority/bibliographic records;etc, following our usual CRUD principles.
Additional notes
Besides references cited above inline, see also: