inveniosoftware-attic / invenio-authorities

Invenio module that adds support for authority records
https://invenio-authorities.readthedocs.io
2 stars 2 forks source link

WIP initial implementation #1

Open tiborsimko opened 9 years ago

tiborsimko commented 9 years ago

Background

In Invenio v1, the bibliographic records (in MARC) could point to authority records (in MARC) via $0 referencing technique, pointing to a certain authority database with a certain authority ID. For example, the bibliographic record would contain:

000000014 100__ $$0AUTHOR|(SzGeCERN)aaa0005$$0INSTITUTE|(SzGeCERN)iii0002$$aEllis, J$$uCERN

and the authority record would contain:

000000118 001__ 118
000000118 035__ $$aAUTHOR|(DLC)n 80141717
000000118 035__ $$aAUTHOR|(VIAF)56770935
000000118 035__ $$aAUTHOR|(SzGeCERN)aaa0005
000000118 100__ $$aEllis, John$$d1946-
000000118 400__ $$aEllis, J.$$d1946-$$q(John),
000000118 400__ $$aEllis, Jonathan Richard$$d1946-

This permitted for example to:

and more.

The behaviour was configurable via CFGBIBAUTHORITY* variables see for example BibAuthority Admin Guide.

In Invenio v2, the invenio-authorities module should reproduce this functionality for general non-MARC records.

Storage

In Invenio v2, the records are described as JSON snippets that match certain JSON schemas. The reference between a bibliographic record and an authority record can be done via standard JSON reference and JSON pointer techniques.

For example, the bibliographic record may look like:

{
   "id": 123,
   "title": "On the foo",
   "authors": [
      {
        "name": "Doe, J",
        "__authority__": { "$ref": "http://localhost/record/118?of=recjson" }
      },
      {
        "name": "Mustermann, E"
      }
    ],
    "keywords": [
       {
           "__authority__": { "$ref": "http://localhost/record/222?of=recjson" }
       },
       {
           "keyword": "physics"
       }
    ]
}

and the authority record may look like:

{
  "id": 118,
  "name": "Doe, John",
  "alternative_names": [
      "Doe, J",
      "Doe, Johnny"
   ],
  "birth year": 2002
}

One of the advantages of using $ref pointers is that one can natively point either to a local authority record or a remote authority record, say CERN Open data pointing to an author record in INSPIRE.

The "master records" would be stored "as is", including $ref. The dereferencing would be done later, see the section about "indexing" below.

Note that master records can user nicer URI schema, such as:

"__authority__": { "$ref": "http://example.org/grant/project-bar" }

and this even for Invenio-managed authority sources, as we are progressively dropping (1) constraint for using the same /record namespace for all records and the (2) constraint for using only numeric record IDs.

Finally, note that the internal __authority__ convention technique permits to distinguish between a field that is under the authority control, but has other local values, and a field that is purely referencing another field, such as:

    "keywords": [
       {
           "__authority__": { "$ref": "http://example.org/keywords/foo" }
       }
    ]

and:

    "keywords": [
       {
          "$ref": "http://example.org/keywords/foo/bar/0"
       }
    ]

because $ref must be a single member of the JSON reference. Hence the usefulness of the __authority__ convention in this document, thought the final implementation may differ. (See also JSON Hyper-Schema.)

Indexing

Records that are stored in JSON in PostgreSQL can be "enriched" or "enhanced" before they are sent to Elasticsearch for information retrieval processing. (E.g. Pythonic tokenisers can add alternative terms before custom Elasticsearch analysers get implemented.) It is at this "enhancing" stage that the authority module would dereference $ref pointers.

The dereferencing is complete, i.e. all pointed information is included. In case of performance of information separation concerns , the dereferencing can be partial, i.e. only parts of pointed information are included via selective JSON pointer technique (/foo/bar).

If the reference is local, the dereferencing step can be done locally inside PostgreSQL via local JSONB queries. If the reference is remote, the standard JSON dereferencing can be used.

Elasticsearch has now access to the enriched JSON and can therefore include additional information in index configurations.

Displaying

Since record display is being done from the enriched JSON that is coming from Elasticsearch, the output Jinja have direct access to the referenced authority information, and can therefore directly display wanted fields.

Management

invenio-authorities module is a layer above invenio-records that stores the record themselves. invenio-authorities offers API to work with __authority__ governed records and fields in both directions:

etc, following our usual CRUD principles.

Additional notes

Besides references cited above inline, see also:

kaplun commented 9 years ago

@glouppe: here is detailed how we are going to store relations across records.

@tiborsimko:

  • an API to list all depending authority records;
  • an API to enhance record JSON by dereferening JSON $ref pointers among authority/bibliographic records;
  • an API to update all dependent bibliographic records (in Elasticsearch) following an authority record change;
  • an API to change all dependent bibliographic records (in PostgreSQL) to point from one authority record to another;
  • an API to remove authority record and update its dependencies (in PostgreSQL);

These APIs are going to be the same also in the context of non authority records. E.g. same needs for dealing with cited-by, with superseded-by, etc. I think these APIs would be would more naturally belong to invenio-records.

kaplun commented 9 years ago
  • an API to change all dependent bibliographic records (in PostgreSQL) to point from one authority record to another;
  • an API to remove authority record and update its dependencies (in PostgreSQL);

Note: these 2 would imply also re-sending updates to Elasticsearch.

tiborsimko commented 9 years ago

These APIs are going to be the same also in the context of non authority records.

Indeed, to a large extent only the look-up of related records would differ. (Touching also JSON Hyper-Schema.)

Note: these 2 would imply also re-sending updates to Elasticsearch.

Yes, that's happening by default with any update done to any record in PostgreSQL.

kaplun commented 9 years ago

Note: these 2 would imply also re-sending updates to Elasticsearch.

Yes, that's happening by default with any update done to any record in PostgreSQL.

Yes, but in the case of PostgreSQL, with JSONB, we could in principle update all references across all records with one query. In this sense it's not for free the signal to elasticsearch (which is instead there if you overwrite one record at a time with one full blob ala bibupload). For performance reasons would be great if, under the hood, these APIs would exploit JSONB when possible.

kaplun commented 9 years ago

These APIs are going to be the same also in the context of non authority records.

Indeed, to a large extent only the look-up of related records would differ. (Touching also JSON Hyper-Schema.)

So will these be part of invenio-records? At worst, you could call this very module invenio-relations, and build on top of it invenio-authorities.

tiborsimko commented 9 years ago

invenio-records should stay very slim, only executing the orders it is given. It would be more advantageous if an upper layer does the intelligent record merging business. (which may or may not require human decisions...)

glouppe commented 9 years ago

A few open questions:

kaplun commented 9 years ago

Right. On the other hand invenio-authorities, as defined above, does two steps too much. Would you recommend using invenio-authorities to handle citations? Wouldn't be strange to call APIs from invenio-authorities in order to update all cited records?

kaplun commented 9 years ago

With pointers from records to authors, do we also need to add explicitly back reference pointers from authors to records? (and maintain those upon changes?)

Wouldn't they be retrievable with a query? (e.g.: give me all the records that match authors.__authority__.$ref == "http://inspirehep.net/record/123)

kaplun commented 9 years ago

So @tiborsimko how do we reach a consensus on this WIP?

kaplun commented 9 years ago

cc: @jmartinm, @jalavik (as from this depends how we amend the data model for relations)