RFC Centralizing record linking

inveniosoftware / invenio

Invenio digital library framework

https://invenio.readthedocs.io

MIT License

625 stars 292 forks source link

RFC Centralizing record linking #2573

Closed kaplun closed 9 years ago

kaplun commented 9 years ago

Current model in Invenio master

Linking among records exist by virtue of dedicated MARC subfields.

Citations

E.g. in Invenio master out of the box, this is a reference (as generated by refextract):

001329723 999C5 $$0647131$$hC. Quesne and V. M. Tkachuk$$o2$$sJ.Phys.,A37,4267$$y2004

$$0 contains the official citation from record 1329723 to record 647131.

This has the following characteristics:

This implementation exposes the internal record IDs.
In case a there is no $0 the other subfields are passed to search engine (by BibRank and by BibFormat) in order to try to find at least one matching record (which in case would be stored subsequently in $0)
In order to know which records are cited by the current one, it's possible to use the search engine with 999C50:1329723 or use the citation dicts wich are cached by BibRank.
When a new record is added in the system which happens to be the one referred by a given citation through the other subfields, the new record has to be searched in 999C5s (using its 773) in order to see if some potential citation has to updated, in this case:

000647131 773__ $$c4267-4281$$pJ.Phys.$$vA37$$y2004

Authors

Authors in 100__ and 700__ have their exact name used to identify the specific MARC field, which is then connected in the aidPERSONIDPAPERS table to an author profile.

These author profiles are connected to authority MARC records via e.g. an internal ID such as ORCID.

Affiliation

In the INSPIRE use case affiliations, next to authors in a paper, are literally connected with a corresponding institute record (via exact string matching).

Photos/Album

In CDS photos are connected to the parent album in a way that photos metadata are agumented with album metadata at indexing time.

Issues with the current model

There is not 1 model but several. Each type of link had to implement a dedicated daemon to enforce consistency, each type of link is manipulated in a dedicated way and indexed/cached in a dedicated way
By explicitly writing in the metadata the recid of the linked record, in case the latter is merged into another one, the latter has to be updated
It's not possible to express linking with a given level of confidence (e.g. imagine when a link is not made in an authoritative way but only guessed).
Proposal

Centralize the control of links by having a new table (or SQLAlchemy Model) with the form:

uuid	from	link type	to	confidence
1234-5678-1234-5678	123	refersto	456	0.9
2345-1234-1243-1245	234	authoredby	567	1.0
2345-1234-1243-1245	234	affiliatedat	678	0.5

The uuid would be associated with a given recjson field upon ingestion by the uploader workflow, for all those fields that have an outgoing link (for example each author field within a record would receive its own uuid, the same would happen for each reference).
Ideally, only relations from records representing static entities (e.g. papers) to dynamic one (e.g. people or institutions) should be stored, while the opposite direction should be calculated (and cached) at runtime. In this way we would avoid inconsistencies and continuous swarm of updates.
The confidence level could be used to store an authoritative link (1.0) or a weaker one (e.g. when it was guessed by a machine learning algorithm or by a fuzzy search). Ideally storing 0 would also have the value of ground truth.
This table could be used to populate calculated fields in recjson, where the recid is stored at runtime when the record is sent to be indexed to ElasticSearch.
When a relation is coming at input time (e.g. when an incoming paper, states that the author foo is really the one of record 123) this relation will be interpreted by a given recjson imported and written into the above relation table (a uuid is generated on the fly).
When a record is modified the uuid should be preserved, when the modification is not about changing a given link.
A new generic widget could be used that curators directly edit specific subset of this table (e.g. all the relations of type refersto for record 123. Such a widget could let the curator assign external records for not resolved relations by allowing for autocompletion etc.
The new bibrank for citation would simply look at this table
The new bibauthorid would manipulate this table
The new indexer based on elastic search would consume this table trough recjson
The new formatter would use this table to create records link

This table might grow over time but, thanks to SQLAlchemy transparent sharding it could be sharded WRT the link_type property, thus mapping to several specific tables.

Side note

Connected with this RFC is the question whether we should still identify records by integers or rather switch to use more generic uuid and namespace them e.g. to use the same tool to handle not only bibliographic records but also annotations, comments, documents, library items... This should be the subject of a new RFC.

glouppe commented 9 years ago

The uuid would be associated with a given recjson field upon ingestion by the uploader workflow, for all those fields that have an outgoing link.

This should be stressed. Uuid are not only for records, but also for fields inside records (like signatures, references, etc).

kaplun commented 9 years ago

Thanks for the feedback Gilles. I have just updated the RFC with examples.

aw-bib commented 9 years ago

Some comments

Authors

It seems not possible to use any of those authorprofile things in a "usual" installation, ie. if one is not Inspire. Simple cause in usual circumstances you never have a (almost) complete database of a field like inspire so all these fancy authormatching tools are in vain and you need to do it by hand. Thus in this case $0 refers to an authority record directly by means of an ID. (Names are quite useless.)

An additional complexity might arrise as the same author may have a bunch of ids. E.g. (from our real world) ORCiD, Inspire-ID, Researcher-Id and Scopus-ID and PubmedID. Upon ingestion of foreighn records you may get either of them. It has to be considered how to treat this. One could in principle rewrite the ID to the one used locally upon ingestion. This may run into trouble, however, if at some later point the prefered Id changes. (As one needs to rewrite all recs.)

Affiliations

Inspire uses repeatable $u and string matching. In HEP this may work as only a limited number of institutions are in the game. In general one can easily run out of unique names.

Therefore a model similar to authors linking via an unique id to an authority record is preferable. In case of affiliation to people linking within record level I suggest to take inner-record linking into account. Ie. linking from one datafield to another in the very same record. join2 uses this to link affiliations between 1001/7001 and 9101. Each 9101 is of the logical quality of your authors usecase (containing Id and name forms for an institution) while the linking from 1001/7001 -> 9101_ is done by means of subfiled $b (author number in the list). This also handles the case where not all authors have a unique id.

I think, that this model would constitute one of the most general use cases.

Side note

Connected with this RFC is the question whether we should still identify records by integers or rather switch to use more generic uuid and namespace

I'd strongly encourage a more abstract entity. The current recid aproach is fine cause it's fast but it's really troublesome if you interchange records between several instances. Thus we avoid this altogether in join2 and refer to other ids only. (E.g. journals, authors, institutes, statistics keys, ... all have unique ids, the prefered ones in 035_a all possible ones in 0247) This allows us to exchange records by means of OAI and still keep things consistent even if of course all record ids change between all the systems in our network. IMHO this switch is strongly advisable even if you want to handle only bibliographic records.

kaplun commented 9 years ago

Hi @aw-bib ,

Simple cause in usual circumstances you never have a (almost) complete database of a field like inspire so all these fancy authormatching tools are in vain and you need to do it by hand. Thus in this case $0 refers to an authority record directly by means of an ID. (Names are quite useless.)

I am not fully understanding what you are trying to explain here: For a given author in a paper, either we have a record representing the author, or we don't have it. What is being proposed here is, in the former scenario, to always have a uuid for the given authorship in the paper, and to link it (either manually in an authoritative form, or automatically in a fancy authormatching way) to the corresponding author(ity) record. If such a record does not exist (in the latter scenario), no realtion is going to be written in the proposed table. At the time of indexing and exporting the paper record, the uuid that have been associated to existing person records are resolved, and an appropriate $0 (or equivalent) is populated on the fly.

An additional complexity might arise as the same author may have a bunch of ids. E.g. (from our real world) ORCiD, Inspire-ID, Researcher-Id and Scopus-ID and PubmedID. Upon ingestion of foreighn records you may get either of them. It has to be considered how to treat this. One could in principle rewrite the ID to the one used locally upon ingestion. This may run into trouble, however, if at some later point the prefered Id changes. (As one needs to rewrite all recs.)

Indeed upon ingestion, the corresponding workflow should be able to interpret all this incoming IDs resolve them to a linked record and write the appropriate relation in the table. Then upon exporting the same behavior as above mentioned could be followed (i.e. to re-use the relation table to expose $0s computed at the end).

One question that arise here though is: what shall we do with the incoming IDs? Shall we write them in the master record, beside resolving UUIDs with the above technique? Probably yes, because in this way, if an incoming ID has not been resolved it can be preserved for the future, in order to resolve it later when the corresponding linked record is created. At this point, IMHO, I think the relation table should become the master source of information to declare what record is linked with what other record and the original IDs should no longer be looked at for the regular internal tasks.

Affiliations Inspire uses repeatable $u and string matching. In HEP this may work as only a limited number of institutions are in the game. In general one can easily run out of unique names.

Exactly, that's why the proposal to use UUID rather than strings.

Therefore a model similar to authors linking via an unique id to an authority record is preferable. In case of affiliation to people linking within record level I suggest to take inner-record linking into account. Ie. linking from one datafield to another in the very same record. join2 uses this to link affiliations between 1001/7001 and 9101. Each 9101 is of the logical quality of your authors usecase (containing Id and name forms for an institution) while the linking from 1001/7001 -> 9101_ is done by means of subfiled $b (author number in the list). This also handles the case where not all authors have a unique id. I think, that this model would constitute one of the most general use cases.

Can you provide with a concrete example to make it more clear? I fear linking intra-fields within a record is not supported in recjson, but since we are using JSON and we are not restricted to the 2-level restriction of MARC, we are free to nest more information if needed. So instead of linking the corresponding 9101 of your case, one would include a copy of it within the tree structure of the 1001. (For the non MARC users, I fear this sounds totally unintelligible :smile:)

Connected with this RFC is the question whether we should still identify records by integers or rather switch to use more generic uuid and namespace I'd strongly encourage a more abstract entity. The current recid aproach is fine cause it's fast but it's really troublesome if you interchange records between several instances. Thus we avoid this altogether in join2 and refer to other ids only. (E.g. journals, authors, institutes, statistics keys, ... all have unique ids, the prefered ones in 035_a all possible ones in 0247) This allows us to exchange records by means of OAI and still keep things consistent even if of course all record ids change between all the systems in our network. IMHO this switch is strongly advisable even if you want to handle only bibliographic records.

Thanks for the feedback, this show the importance to discuss also this topic then.

aw-bib commented 9 years ago

Hi @kaplun,

Simple cause in usual circumstances you never have a (almost) complete database of a field like inspire so all these fancy authormatching tools are in vain and you need to do it by hand. Thus in this case $0 refers to an authority record directly by means of an ID. (Names are quite useless.)

I am not fully understanding what you are trying to explain here:

I understood your author sample as based on the automatic author clustering done in Inspire. I just wanted to note, that automatic author clustering will not work for any other institution than Inspire. Probably, I mixed up the authorid stuff?

For a given author in a paper, either we have a record representing the author, or we don't have it.

Agree. Common is especially "we don't have it". E.g. in our repos "we don't have it" coincides with "we don't (==can not) care as she is not a member of our institution".

What is being proposed here is, in the former scenario, to always have a uuid for the given authorship in the paper, and to link it (either manually in an authoritative form, or automatically in a fancy authormatching way) to the corresponding author(ity) record.

Probably, uuid was miss leading. It does not refer to a persons authority record id, right? It can be something temporary/internally, right? I understand that in case you get 10 papers from a Joe Doe without any external ids your proposal is to generate 10 different uuids unless some other process, whatever that is says "ok, for records 5,6,7 and 9 the uuid associated with string Joe Doe should be 12345, drop the temporary ones".

If such a record does not exist (in the latter scenario), no relation is going to be written in the proposed table. At the time of indexing and exporting the paper record, the uuid that have been associated to existing person records are resolved, and an appropriate $0 (or equivalent) is populated on the fly.

I wonder if it is not better to populate the name from the authority and base all this stuff on the ids instead. This allows you to fix spelling errors easily. OTOH one might have cases where you want to preserve (even if it's wrong or different) what's written in the original record. (E.g. name changes of authors.)

But in any case you'll need to tackle things like two authors with the same name on the same paper. (Especially if you only get initials.)

[...]

Indeed upon ingestion, the corresponding workflow should be able to interpret all this incoming IDs resolve them to a linked record and write the appropriate relation in the table.

That is you base matching of Ids to ingestion and have sort of a "master id", right? (That's what we do in join2 repos. E.g. we get inspire ids, match them and store our ids.)

One question that arise here though is: what shall we do with the incoming IDs? Shall we write them in the master record, beside resolving UUIDs with the above technique? Probably yes, because in this way, if an incoming ID has not been resolved it can be preserved for the future, in order to resolve it later when the corresponding linked record is created.

I think it would be clever to store them, yes. Based on my marcish view of bibliographic records one can think of something like master id (ie. authority id) to $0, other id to some other subfield. It may well happen that you sort the ingestion id to a given author just to get notified later on that all papers from a Mr. Lee with id x actually are another Lee than the one they were sorted to. Errors happen and one could probably easier resolve them that way. Similarly, it may happen that one joins ids cause people e.g. didn't understand ORCiD correctly and registered several of them.

At this point, IMHO, I think the relation table should become the master source of information to declare what record is linked with what other record and the original IDs should no longer be looked at for the regular internal tasks.

I'm not sure if the relation table is a good source as master. From a cataloguers point of view you may need to fix some errors and as a cataloguer you live on the bibliographic and authority records. My feeling is that the bibliographic record and the authorities should be the master, as those can be exposed easily to well trained humans for curration.

Affiliations Inspire uses repeatable $u and string matching. In HEP this may work as only a limited number of institutions are in the game. In general one can easily run out of unique names.

Exactly, that's why the proposal to use UUID rather than strings.

I really like your idea aof uuids. And especially the nice little point that they are not record ids. :smile:

Therefore a model similar to authors linking via an unique id to an authority record is preferable. In case of affiliation to people linking within record level I suggest to take inner-record linking into account. Ie. linking from one datafield to another in the very same record. join2 uses this to link affiliations between 1001/7001 and 9101. Each 9101 is of the logical quality of your authors usecase (containing Id and name forms for an institution) while the linking from 1001/7001 -> 9101_ is done by means of subfiled $b (author number in the list). This also handles the case where not all authors have a unique id. I think, that this model would constitute one of the most general use cases. Can you provide with a concrete example to make it more clear? I fear linking intra-fields within a record is not supported in recjson, but since we are using JSON and we are not restricted to the 2-level restriction of MARC, we are free to nest more information if needed.

IMHO your felt 2-level-restriction is not really existent if you take the whole thing. Basically, Marc allows you to link datafields within a given record. (In your jsonish view this is something like a field having a list of dicts as subfields.)

If you e.g. check out a record like this:

https://bib-pubdb1.desy.de/record/191789

we have (the indicators 1_ just specify in which order we write the texts in $a, that's why we don't use __):

1001_ $0P:(DE-H253)PIP1002250
      $aKöhler, Martin
      $b0
      $eCorresponding Author
      $udesy

9101_ $0I:(DE-588b)2008985-5
      $6P:(DE-H253)PIP1002250
      $aDeutsches Elektronen-Synchrotron
      $b0
      $kDESY

This example shows an identified author. You can see his id in $0, he is first author ($b == 0), he is corresponding author ($e), and the $u basically redundant, we just add it due to our Inspire-like history.

If you check 9101_ you see again a $b0 which says, this datafield belongs to a 100 or 700 field with $b thus actually being the linking field. Ie. 1001_b <-> 9101_b. If you want to see it in Inspire speak, our 9101_ is the $u in Inspire. Just quite a bit more structure and richer in contents, and also fed and linked to an authority. 9101_ is also repeatable in case you have more than one affiliation. I agree with you that currently there's only manual bibedit as GUI for handling this in Invenio and that this can be prone to errors due to the potential complexity involved.

Now, one might think that we could also link 1001_$0 <-> 9101_6. This is correct in this case, as we know the author and have an authority record for him. But: we don't have authorities for everyone plus we also have a "pseudoauthority" for "external authors" (authors that are not belonging to our institution). This is a detail needed for our bean counting. Basically it distinguishes between "I know this guy is not from DESY" and "I don't know if she is from DESY or not". (Probably, the next step in our workflow knows it.) This $6$b redundancy is somewhat similar to the one propose above for ingestion ids rewritten to other ids.

BTW: We follow here a logic proposed by LoC for Canadian libraries concerning field values (though they originally thought of bilingual cataloguing)

http://www.loc.gov/marc/bibliographic/bd9xx.html

So instead of linking the corresponding 9101 of your case, one would include a copy of it within the tree structure of the 1001. (For the non MARC users, I fear this sounds totally unintelligible :smile:)

Well, not to many cataloguers who don't use Marc these days. ;)

But if I understand you correctly, your internal stuff can be exposed directly for cataloguers as inner record linking, which then gives a nice interface to whatever fancy things happen behind the scenes.

kaplun commented 9 years ago

If I manage to convince @aw-bib it means that the model is robust enough... :smile: Let's see...

I understood your author sample as based on the automatic author clustering done in Inspire. I just wanted to note, that automatic author clustering will not work for any other institution than Inspire. Probably, I mixed up the authorid stuff?

The goal with this proposal is to satisfy both contexts, the institutional one and the community one (a la HEP). From a technical point of view within the model, the only difference between a link to an authority record done in authoritative way, and an automatic guess is going to be in the value written in the confidence column. In institutional context this value will very likely be always 1. In the guessing model it's going to be a value between 0 and 1. To guess the value is completely an optional aspect of the proposed model.

Probably, uuid was miss leading. It does not refer to a persons authority record id, right? It can be something temporary/internally, right?

Exactly, minus I think the uuid should not be temporary rather a permanent identifier that is supposed to travel with the specific field (e.g. even if a typo is corrected in a field, the uuid should be preserved, if two records are merged UUIDs will have to be merged alongside - what this mean is actually subject to further discussion).

I understand that in case you get 10 papers from a Joe Doe without any external ids your proposal is to generate 10 different uuids unless some other process, whatever that is says "ok, for records 5,6,7 and 9 the uuid associated with string Joe Doe should be 12345, drop the temporary ones".

Exactly, but for the fact that I don't propose to drop the UUID, rather to write in the table that the 4 uuids for the Joe Doe of paper 5, 6, 7, 9 is associated with the corresponding Joe Doe record (with a given confidence level - 1 in case of institutional archive and no guessing).

The relation would then be reconstructed (with the UUIDs being replaced by more canonical IDs) upon indexing or upon exporting the data (e.g. via OAI-PMH or other APIs)

I'm not sure if the relation table is a good source as master. From a catalogers point of view you may need to fix some errors and as a cataloger you live on the bibliographic and authority records. My feeling is that the bibliographic record and the authorities should be the master, as those can be exposed easily to well trained humans for curation.

I see your concern. But that can be resolved by providing the right and usable tools to the cataloger to handle this piece of information. This proposal tries to bring under a same pattern the various facets of linking. If well designed it should be possible to develop only once a generic type of widget that would let the cataloger manipulate the relations in a consistent and predictable way. Imagine an interface that can let the cataloger edit the different fields of a JSON-based record, and that - when it comes to a field having an external relation - would present a widget where the cataloger could exploit auto-completion to an authority collection, and popping up the creation of an authority record in case of need...

The cataloger does not need to know the relation table, the important thing is that the various tools synchronize on its usage.

Well, not to many catalogers who don't use Marc these days. ;)

I have to defend the species of Homo Developer that is fresh from having studied Databases and Data Structures at university and starts to loose hairs when he/she deals with MARC :smile:.

But if I understand you correctly, your internal stuff can be exposed directly for catalogers as inner record linking, which then gives a nice interface to whatever fancy things happen behind the scenes.

Exactly, if we manage to design the thing properly, the cataloger should not be exposed to the UUID but should simply think of the actual action of linking records.

aw-bib commented 9 years ago

If I manage to convince @aw-bib it means that the model is robust enough... :smile: Let's see...

Your chances are quite good as I like your approach and also liked your presentation at the dev forum.

[...]

The goal with this proposal is to satisfy both contexts, the institutional one and the community one (a la HEP).

Understood. I think I sorted authorid mentioned in the wrong place. If the authorid refers to some sort of authority record ("author pages") all seems well.

From a technical point of view within the model, the only difference between a link to an authority record done in authoritative way

Understood this point and even like the confidence level. It was really that I attached authorid to this turtle/rabbit-thingies that only work in Inspire like contexts.

Probably, uuid was miss leading. It does not refer to a persons authority record id, right? It can be something temporary/internally, right? Exactly, minus I think the uuid should not be temporary rather a permanent identifier that is supposed to travel with the specific field

Ok. One can do that of course.

(e.g. even if a typo is corrected in a field, the uuid should be preserved, if two records are merged UUIDs will have to be merged alongside - what this mean is actually subject to further discussion).

Merging is an issue, sure. A valid approach in my marcish world is collecting them in 0247_ of the authors authority record. One might consider another field probably as if I get your approach correctly one will end up with many ids. (Each paper ingested without one will generate an new id, right?)

I understand that in case you get 10 papers from a Joe Doe without any external ids your proposal is to generate 10 different uuids unless some other process, whatever that is says "ok, for records 5,6,7 and 9 the uuid associated with string Joe Doe should be 12345, drop the temporary ones". Exactly, but for the fact that I don't propose to drop the UUID, rather to write in the table that the 4 uuids for the Joe Doe of paper 5, 6, 7, 9 is associated with the corresponding Joe Doe record (with a given confidence level - 1 in case of institutional archive and no guessing).

One can do that. However, it will store quite a lot of UUIDs. I'm not thinking about the machine here but the bio processor in front of it, usually called "cataloguer" who has to cope with them, as far as I understand it.

The relation would then be reconstructed (with the UUIDs being replaced by more canonical IDs) upon indexing or upon exporting the data (e.g. via OAI-PMH or other APIs)

Hm, I think this canonical ID should really get stored in it's $0.

I'm not sure if the relation table is a good source as master. From a catalogers point of view you may need to fix some errors and as a cataloger you live on the bibliographic and authority records. My feeling is that the bibliographic record and the authorities should be the master, as those can be exposed easily to well trained humans for curation. I see your concern. But that can be resolved by providing the right and usable tools to the cataloger

I have the strong feeling that this right tool should be bibedit and that bibedit should work well with those curration tasks. At least I as a cataloguer (and I did a "bit" of cataloguing myself) would want to have it there and only there.

to handle this piece of information. This proposal tries to bring under a same pattern the various facets of linking.

Understood.

If well designed it should be possible to develop only once a generic type of widget that would let the cataloger manipulate the relations in a consistent and predictable way. Imagine an interface that can let the cataloger edit the different fields of a JSON-based record,

Just expose them in Marc and every cataloguer will be happy. (%s/Marc/<whatever inern format you use, your cataloguer will know it>/g)

and that - when it comes to a field having an external relation - would present a widget where the cataloger could exploit auto-completion to an authority collection, and popping up the creation of an authority record in case of need...

In principle I, agree. However, don't overestimate autocomplete if you have a cataloguer. These guys often really know what they're doing and if they have to wait for your autocomplete to point an click, I as a cataloguer would label it as PITA. :smile: I think a better GUI would be to allow keying in a value and have some hotkey that triggers lookup only if I need it at all. (If I curate a bunch of records from one author I just know her Id, believe it or not.)

The cataloger does not need to know the relation table, the important thing is that the various tools synchronize on its usage.

Perfectly agree. And she doesn't want yet another tool beyond the Marc that's open already in bibedit anyway.

Well, not to many catalogers who don't use Marc these days. ;) I have to defend the species of Homo Developer that is fresh from having studied Databases and Data Structures at university and starts to loose hairs when he/she deals with MARC :smile:.

I understand your point. But I fear if you're developing software for users the developers are just hm, to be polite, "not so relevant". Anyway, as a cataloguer I don't care at all how your API internally looks like. I don't see it anyway and I don't want to see it.

But if I understand you correctly, your internal stuff can be exposed directly for catalogers as inner record linking, which then gives a nice interface to whatever fancy things happen behind the scenes. Exactly, if we manage to design the thing properly, the cataloger should not be exposed to the UUID but should simply think of the actual action of linking records.

Form a cataloguers point of view I'm not sure that entirely hiding all UUIDs is clever as they can help me a lot in unification. If it's a meaningful ID and not just some machine thingy, I mean. Probably this point is the only one yet a bit unclear: what is this UUID meant to be at the end.

kaplun commented 9 years ago

From a technical point of view within the model, the only difference between a link to an authority record done in authoritative way

Understood this point and even like the confidence level. It was really that I attached authorid to this turtle/rabbit-thingies that only work in Inspire like contexts.

If we don't focus too much on authors, you'll see that guessing is today applied also to other attributes, such as references. So this is something that might still have a value in an institutional context too. Anyway we are not both understanding each other on this point :smile:

Merging is an issue, sure. A valid approach in my marcish world is collecting them in 0247_ of the authors authority record. One might consider another field probably as if I get your approach correctly one will end up with many ids. (Each paper ingested without one will generate an new id, right?)

OK let's reclarify: the proposed UUIDs are going to be automatically created and added next to every field in a record that might point/is pointing to another record. The UUID has no meaning, until there is going to be an entry in the above mentioned table which says that the UUID in record X links to record Y. The UUIDs are not going to be stored on the authority record, and they are not be considered as identifiers in the authority record. So probably, in case of merging of two duplicate records, that have e.g. both the same authors (but each with different UUIDs, because UUIDs are universally unique) one of each author should be simply removed from the merged record and the corresponding UUID will disappear from the final record. At that point some garbage collection should be triggered to correctly update the relation table.

One can do that. However, it will store quite a lot of UUIDs. I'm not thinking about the machine here but the bio processor in front of it, usually called "cataloguer" who has to cope with them, as far as I understand it.

So hopefully cataloguer should not see UUIDs and we are happy. It all depends on which interfaces the cataloguer will have in front, of course.

Just expose them in Marc and every cataloguer will be happy. (%s/Marc//g)

This model should work beyond MARC, E.g. in case MARC is not the master format. It all depends on what is going to be the master format of a given Invenio installation. Ideally we should have a web-based tool for the cataloguer to manipulate all the fields described in the recjson configuration, and if these map one-to-one with MARC then the tools should behave as in BibEdit. Mmh, this is really vage and subject to another RFC.

However, don't overestimate autocomplete if you have a cataloguer. These guys often really know what they're doing and if they have to wait for your autocomplete to point an click, I as a cataloguer would label it as PITA. I think a better GUI would be to allow keying in a value and have some hotkey that triggers lookup only if I need it at all. (If I curate a bunch of records from one author I just know her Id, believe it or not.)

Don't understimate autocomplete :stuck_out_tongue_winking_eye: Have you noticed how Google nowadays autocomplete your searches? It doesn't force you to point and click.

Perfectly agree. And she doesn't want yet another tool beyond the Marc that's open already in bibedit anyway.

Yep, that's why I´d like to propose soon an RFC on a holistic editor that would let a cataloger edit all the fields that have been modeled in BibField configuration, with smart behaviors such as being aware of record links mentioned in this RFC.

I understand your point. But I fear if you're developing software for users the developers are just hm, to be polite, "not so relevant"

Given the developers are necessary for a service to exist, I believe the user should not be responsible for "designing" the code, rather to seek together with the developer the real needed functionalities and objectives (say make records retrievable, say being fast in editing all the properties of a record),

Anyway, as a cataloguer I don't care at all how your API internally looks like. I don't see it anyway and I don't want to see it.

Yep.

Form a cataloguers point of view I'm not sure that entirely hiding all UUIDs is clever as they can help me a lot in unification. If it's a meaningful ID and not just some machine thingy, I mean. Probably this point is the only one yet a bit unclear: what is this UUID meant to be at the end.

Yes this is the point that requires clarification. I hope I managed to explain better myself above. The UUIDs are meaningless and just a way to be able to create a link from a field within a record to another record. It's this link that carries the meaning. This can be made explicit in a cataloger record editor by letting the cataloguer build this link. If the given field (e.g. author) refers to an authority record through one of the authority record identifiers that's is part of the metadata and is going to be stored as all the other properties of a given field (the name of the author, etc.)

Let's make an example: The paper "On the foo and the bar" was authored by John Doe who is having an authority record at CERN (he's the user 1234)

{
  "title": "On the foo and the bar",
  "authors": [
    {
      "affiliation": "Atlantis Institute", 
      "first_name": "Doe", 
      "last_name": "John", 
      "cern_id": "1234"
    }
  ]
}

Upon ingestion of this record, the uploader attributes on the fly a UUID to each author field:

{
  "title": "On the foo and the bar",
  "authors": [
    {
      "affiliation": "Atlantis Institute", 
      "first_name": "Doe", 
      "last_name": "John", 
      "cern_id": "1234",
      "uuid": "9876-1234-6789-1234"
    }
  ]
}

If the authority record corresponding to John Doe does exist (say with ID 123), and we are not interested in using an automatic disambiguation process (because we always have IDs, right?), the uploading workflow (as you know in pu bibupload is being rewritten to be impmented as a configurable workflow), could take care of inserting the new relation in the above table. So say that the incoming record has ID 234,

uuid	from	link type	to	confidence
9876-1234-6789-1234	234	authority	123	1.0

That's it. Upon presenting the paper record, the formatting will query the table to transform the UUID into a link to 234, upon indexing the indexer will query the table to index the paper with metadata from the 234, upon editing the cataloger will be able to edit the original metadata (e.g. the "ccid" if it was wrong), and/or to pick a new referred authority record. Now I see from what I am writing that, if before the cataloger would have edited simply the ID pointing to the authority record now she would be expected to carry over two actions. But then again we could re-delegate to the uploader the action of rebuilding the relation between the UUID and the authority record, following the update the cataloger did on the regular ID. Alternatively we can say that, if a relation exists between an UUID and an authority record, then this relation is the master one and any additional IDs in the paper record (e.g. the above "ccid") should be ignored. Upon indexing/formatting/exporting the relation table should be used. The original "ccid" could simply be preserved as original data.

aw-bib commented 9 years ago

[...]

If we don't focus too much on authors,

Did you miss that we basically agree? ;)

BTW: I don't focus on authors at all. It's just a common example and it's a bit more complex than many others due to the n affiliation stuff.

you'll see that guessing is today applied also to other attributes, such as references. So this is something that might still have a value in an institutional context too.

I fear you're wrong here. Reference formatting is to specific for various communitites, I don't believe there will be any working solution (unless all refs are just a doi or arXiv-Id and drop this useless text clutter.) But that's another issue.

Merging is an issue, sure. A valid approach in my marcish world is collecting them in 0247_ of the authors authority record. One might consider another field probably as if I get your approach correctly one will end up with many ids. (Each paper ingested without one will generate an new id, right?)

OK let's reclarify:

Ok, so I understood it correctly.

One can do that. However, it will store quite a lot of UUIDs. I'm not thinking about the machine here but the bio processor in front of it, usually called "cataloguer" who has to cope with them, as far as I understand it.

So hopefully cataloguer should not see UUIDs and we are happy. It all depends on which interfaces the cataloguer will have in front, of course.

I believe, you trust to much in fancy stuff here. But that is a feeling. :wink:

Just expose them in Marc and every cataloguer will be happy. (%s/Marc//g) This model should work beyond MARC

Understood.

github/markdown stripped the replacement term in the %s stanza as I enclosed it in braces: it would have been "place your favourite internal format here". (%s/from/to/g: substitute everywhere from by to)

However, don't overestimate autocomplete if you have a cataloguer. Don't understimate autocomplete :stuck_out_tongue_winking_eye:

I see your point, but believe it or not if you do a lot of cataloguing it is better if you'll not have to wait for something fancy to happen all the time.

Have you noticed how Google nowadays autocomplete your searches?

I know it, but I usually don't use google. (Really, I rarely google.) What happens here anyway is that google suggests something. If it matches I type to fast for being able to use it or it's plainly wrong (in many cases) so I have to keep on typing anyway. So, if they don't have it I don't miss it.

It doesn't force you to point and click.

This is not the point. I type at some 400 chars/min. This is faster that any autocomplete, I fear. (At least I'm not fast enough to care about suggestions coming up.) So autocomplete is ok if it doesn't hinder me, but I often wish that it just keeps silent. The other point is: you don't get the same statistics as help like google. Anyway, I'd just separate a GUI issue and leave it here.

[...]

Yes this is the point that requires clarification. I hope I managed to explain better myself above. The UUIDs are meaningless and just a way to be able to create a link from a field within a record to another record. It's this link that carries the meaning. This can be made explicit in a cataloger record editor by letting the cataloguer build this link. If the given field (e.g. author) refers to an authority record through one of the authority record identifiers that's is part of the metadata and is going to be stored as all the other properties of a given field (the name of the author, etc.) Let's make an example:

I can follow your example

{
  "title": "On the foo and the bar",
  "authors": [
    {
      "affiliation": "Atlantis Institute", 
      "first_name": "Doe", 
      "last_name": "John", 
      "cern_id": "1234"
    }
  ]
}

Points I wonder about is:

shouldn't get (cern_id, fist_name, lastname) get an UUID
shouldn't get affiliation another UUID

But it could be that this is just a toy model for author. Anyway, it touches this innerrecord linking you were asking about. Basically, I think affiliation needs to be linked to another field. (In my sample the $b would play the role of this UUID though our $b is only unique within one record. Has the advantage that it's shorter and easier to type.)

If the authority record corresponding [...]

Ok. This is all nice. I have no problem with that so far.

That's it. Upon presenting the paper record, the formatting will query the table to transform the UUID into a link to 234, upon indexing the indexer will query the table to index the paper with metadata from the 234.

This is all nice and I think handles linking and authority indexing and stuff cleanly.

upon editing the cataloger will be able to edit the original metadata (e.g. the "ccid" if it was wrong), and/or to pick a new referred authority record.

Here's the only point I wonder. Would you see the UUID here?

Now I see from what I am writing that, if before the cataloger would have edited simply the ID pointing to the authority record now she would be expected to carry over two actions.

Right.

And if she doesn't see the intermediary UUID stuff she's flying partly blind. If you only present me a name, I'm lost. I need more info about this Joe Doe (could be an external source like a phone call: this is my paper) and about the various authority records of the Joe Does around. The guy who called me is the Joe Doe with ID 7890 I know this because in his authority record I see data identifying him. As a cataloguer I usually have to refer to the authority records to select the right one.

BTW: one GUI I know and that worked well for this is built by entering the term to look up, press a key which triggers a search, you select from the search results the full record (in it's marcish presentation) and if it's correct you have a key to come back and insert the links by ids into the original record. (PICA works that way.) Note that they really call up the authority records in their full internal format. Form my experience you need something like this to really disambiguate authors, institutes, keywords, whathaveyou.

[...]

Alternatively we can say that, if a relation exists between an UUID and an authority record, then this relation is the master one and any additional IDs in the paper record (e.g. the above "ccid") should be ignored.

I admit I do not like the idea of a hidden table to overwrite what I can see and curate. Probably, this is my problem here: if I base the game on Ids I can see and curate and use to fix relations I wonder what the UUID does if it's hidden. If I base it on the UUIDs they should in some way end up in my frontend (ie. records). Or more simply they need to be accessible without python/sql/a programmer.

In any way I think one needs a cataloguing access to the intermediary table and I think it should be built from and not the other way round.

kaplun commented 9 years ago

This is not the point. I type at some 400 chars/min. This is faster that any autocomplete, I fear. (At least I'm not fast enough to care about suggestions coming up.)

OK but I wouldn't call you the average cataloger :stuck_out_tongue:

But it could be that this is just a toy model for author.

Yes sorry was considering a very simple example not focusing on inner records. Will have to make a more complex one.

And if she doesn't see the intermediary UUID stuff she's flying partly blind. If you only present me a name, I'm lost.

Here comes in rescue the smart widget. If the UUID was resolved you would see here a brief representation of the authority record. If the UUID has to be resolved you would have here an input field with autocompletion where you would be able to type all sorts of metadata to match potential authority records and in the drop-down autocomplete list you would again see brief representation of potential authority records. (or the raw format if you prefer). Basically as the editor you describe:

BTW: one GUI I know and that worked well for this is built by entering the term to look up, press a key which triggers a search, you select from the search results the full record (in it's marcish presentation) and if it's correct you have a key to come back and insert the links by ids into the original record. (PICA works that way.) Note that they really call up the authority records in their full internal format.

So indeed yes, we would have to implement something as PICA.

I admit I do not like the idea of a hidden table to overwrite what I can see and curate. Probably, this is my problem here: if I base the game on Ids I can see and curate and use to fix relations I wonder what the UUID does if it's hidden. If I base it on the UUIDs they should in some way end up in my frontend (ie. records). Or more simply they need to be accessible without python/sql/a programmer.

Yes I am not sure on this point either. I am open to suggestions. An alternative is that the cataloger could have a quick way from the interface to automatically create the new relation after having modified a given ID. And viceversa, if she selects a given authority record (thus updating the UUID relation in the background), any ID in the field would be updated with the corresponding IDs from the authority record (e.g. automatically or by hotkeys, or button). WDYT?

The thing is that, beside this cons, I find this model having many pros from the point of view of the future design of Invenio, allowing us to build many more consistent functionalities.

kaplun commented 9 years ago

OK I think we kind of agree on the proposed data model, albeit outer layers (in particular WRT cataloging tools) need to be ironed out. @jirikuncar is there anyone in the core team who can help in implementing the SQL model and provide guidance on how to integrate with BibField?

aw-bib commented 9 years ago

@kaplun any thoughts about the inner record stuff mentioned yet? I think it could be done in the same model. (No reason why the linking field's id should not be a uuid, except probably typing.)

kaplun commented 9 years ago

@aw-bib you mean the linking between fields within the same record? That I think is better made by exploiting that JSON let you nest trees within trees...

jirikuncar commented 9 years ago

@kaplun see the initial proposal. When we agree upon:

[ ] module name
[ ] table name
[ ] basic api requirements and naming

I will create WIP pull-request from the branch.

cc @tiborsimko @lnielsen-cern @glouppe @egabancho

MSusik commented 9 years ago

The work has been moved to #2719 . Please also move the discussion there.

jirikuncar commented 9 years ago

Please also move the discussion there.

General discussion should stay here. In PR please discuss only code issues. Thanks

MSusik commented 9 years ago

Every record in Invenio is a node in a relationship graph (obviously many of them might be isolated). Shouldn't the interface of Node class be implemented directly inside Record class?

MSusik commented 9 years ago

Following yesterday's discussion, here are my suggestions:

Neo4j seems to provide us with all the features that we need. The only problem I see are the persistent identifiers. Every node/edge in neo4j graph has a unique id, but this id can be reused by another entity after the given node/edge is deleted. It means we'll have to watch out for any changes and update information in record. In case of using Neo4j, the simpliest approach seems to be using py2neo and flask-neo4j
There are a lot of open source alternatives to Neo4j, and if we want to go this direction, it is worth to discuss all/some of them. Note that Neo4j is the most popular solution. Here are few alternatives that I consider the most interesting ones:
- ArangoDB - this one I find particularly interesting as it perfectly suits our simpliest use-case. It is a mix of NOSQL and graph approach. It might be worth to represent whole records using this engine.
- TinkerPop - it's actually not an engine, but a standard with own definition of graph model, dataflow framework and query language. You can check here supported engines. It should be easy to create a solution where the admin of the site decides which engine to use, and we only use TinkerPop API to call it.
- Titan - various storage backend, quite popular, comparable performance to neo4j and support for elasticsearch (whatever it means).

There are a lof of engines comparisions (performance-wise and featurewise) available, here I list few of them: http://ups.savba.sk/~marek/papers/gdm12-ciglan.pdf http://www.stingergraph.com/data/uploads/papers/ppaa2014.pdf http://euranova.eu/upl_docs/publications/an-empirical-comparison-of-graph-databases.pdf https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0 http://en.wikipedia.org/wiki/Graph_database

One of the arguments raised against our own implementation was that it might be harder to link between different data structures ("models"). As an example the link between record and document was mentioned. What are the different data structures that might use links? How will we represent document?

glouppe commented 9 years ago

One of the arguments raised against our own implementation was that it might be harder to link between different data structures ("models"). As an example the link between record and document was mentioned. What are the different data structures that might use links? How will we represent document?

I was not there yesterday, but for me this also raises the issue: what are the primary citizen objects in Invenio? Do we want to work under the assumption that everything is a record, and therefore that relations should only exists between records? Or do we assume a framework using a more abstract object representation, eg. SQLAlchemy models, in which case relations would happen only between these objects? Or do we want something else?

Regarding neo4j, what would be the advantages of using an external engine for relations rather than having them as SQLAlchemy objects? I am not concinved the added complexity is worth it.

In any case, I believe it is time for all of us to finally agree on these core design decisions if we want to move on.

jirikuncar commented 9 years ago

Short Summary

Features

[ ] multiple linkable objects and relationships (Record to Record, Record to Document, Document to Document, ...)
[ ] simple API (e.g Relationship.create(Record.get(1), 'citedby', Record.get(2), confidence=0.7, **kwargs))
[ ] helper for calculated JSONAlchemy fields based on relationship type
[ ] integration with forms - order, relation identifier

Implementation

[ ] start with designing clean API
[ ] prepare dummy "in memory" implementation to pass tests
[ ] check possible engines - SQLAlchemy, Neo4J, ...

MSusik commented 9 years ago

multiple linkable objects and relationships (Record to Record, Record to Document, Document to Document, ...)

Does it mean that I should handle every namespace and it will cover every case?

check possible engines - SQLAlchemy, Neo4J, ...

Given how the implementation process looks like in Invenio, I can assume that for a long time we will have only one engine available. Can we agree which one to use? I can prepare some comparision for next monday meeting if you want.

jirikuncar commented 9 years ago

Can we agree which one to use?

Start with simple python dict (in memory) to finalize API.

tiborsimko commented 9 years ago

Here is promised summary of my comments, most of which we discussed IRL on Monday.

WRT use cases, there are some more, e.g. the linking in Periodicals or among Institutes, e.g. several other kinds of inter-record linking in CERN Open Data portal, such as "constituent unit entry", tag 774. See open data JSON MARC configuration.

WRT proposed table, we may need an extra column storing properties of the relation. Say record R1 is a photo that was used as B/W on page 7 of record R2, and as RGB on page 2 of record R3. Where to store this information? It logically belongs with the relation, not with the records themselves. Hence perhaps a new PostgreSQL JSON column representing abstract key-value properties store, where people can store anything they need to describe the properties of the relationship. (Kind of like "more info" store in invenio/master.)

(WRT various types of relationships, looking at all the various subfields existing in the MARC relational tags shows that this property store can become rather big. Hence perhaps a question whether an abstract column would be enough or whether we need some further structure? I guess a generic JSON column might be enough, since it is usually fast to search there, however the plethora of relationship properties may require some deeper thoughts.)

WRT records vs fields as first-class linkable citizens, wouldn't it be nice to use UUIDs for almost everything, and allow linking on any part of JSON nested level. (See @aw-bib's example of inter-field linking inside the same MARC record.) Kind of decorate parts of JSON in a LD style, meaning "factor this tree out for me into a new record in another JSON store and link them together". This would be great for low-level system users.

(If we make also fields linkable, then this mechanism could be used for poor-man's authority record control, e.g. field representing "arXiv subject category" has only ~10 values (hep-th, hep-ph, etc) and so could just as well live as a link to this "JSON based knowledge base", so to speak. When do we use "by-reference linking" and where do we stop and copy values and use "multi-edit" to manage the linking information?)

WRT possibly representing relation information in record's "master format" data, we have several use cases: (1) developers or system librarians are working with JSON all the way as much as possible; (2) cataloguers and curators might use tools the developers write, or might prefer to work with "native aster format" such as MARC or UNIMARC or EAD that is in use on the given concrete Invenio installation; (3) general public will be served any kind of "serialised version" of the record, in read-only manner, so this is less critical. My concern is about the second category, where we may need to offer read-write like access to non-JSON formats.

E.g. imagine a site receiving MARC authority records from a third party; if they want to enrich the information, and send it back, and if they keep receiving and sending updates to these records, both from third party and from authors, say, then do we want to represent the relationship information only in the JSON or also in this "editable native format" that cataloguers are working with? If we speak about an Invenio installation that runs with maybe half an IT person and five cataloguer persons, using some MS Windows EAD editor or something, and constantly receiving MARC authority records from the National Library, the answer would be yes. Hence, if we represent linked information in these "native editable formats", we should make things really lossless, which can go wild. Would cataloguers agree to see and edit UUIDs? Would we hide them semi-transparently somehow?

IOW, would we aim at lossless representation of the relationship information also in the "native editable format" the concrete site may use (or not), which can get quite complex just looking at generating MARC out of the relationship property store? And if we do, would we want to represent it in a nice editable manner, not exposing internal UUIDs too much?

WRT using a graph database, this should be definitely seriously considered...

[raw brain dump]

aw-bib commented 9 years ago

@tiborsimko concerning cataloguers to use UUID:

I think they will accept them easily
but they should not be 40 digits hex codes like git commit ids (IMHO this is not necessary in this usecase)

A real world example where cataloguers handle UUIDs in their native cataloguing interface (bibedit in inveniospeak) would e.g. be GBV (the largest union catalogue in Germany; same is true for all other PICA installations) where this is in production for [place a long time her].

What's done, however, is that the editor offers a simple lookup interface to retrieve the UUID easily as outlined above. Also their UUIDs look more like this: 81250898X. IOW: just don't make them to wild ;-) Having source of ids in some other field to make them exchangeable is usually not a problem as well. (0247_ $2 $a thingies in Marc.) That allows even for quite "lenghty" UUIDs as the final id is formed like strcat($2, $a). In Chris paper guys like author:(DE-588b)141792906 ar of this form.

With UUIDs like this cataloguers to my experience are even happy exchange them phone. It was common practice at my job that I got a number like this to refer to a dataset. Note also that their bibedit takes care of filling in some fields automagically once the UUID is in place. This helps consistency.

MSusik commented 9 years ago

@tiborsimko Thank you for your input!

WRT proposed table, my PR now includes a new column.

WRT records vs fields as first-class linkable citizens. This might drag us into more relational-like approach like the one that is used in master. I'm not sure if we want it. The possibility of linking with a single field/subfield of a record is tempting, but then storing data in JSON seems not to be the right choice. If we started with storing everything in graph db, then it would be much easier to implement and maintain. Still:

"factor this tree out for me into a new record in another JSON store and link them together"

As long as it does not create a physical new record, it is a possibility worth considering.

WRT using a graph database I would suggest discussing it on monday meeting.

kaplun commented 9 years ago

After several discussions, parallel meeting, forums... :smile: we came to the current conclusion that relations are going to be implemented in the most naïf way, i.e. by simply storing in a given subfield within the JSON representation of a record (see #2854) the identifier of an other record.

E.g. paper author:

{
    "recid": 123,
    "authors": [{
        "fullname": "Doe, John",
        "recid": 234}]
}

Resolving the external record has to be implemented on a case by case. E.g. elasticsearch will have dedicated enhancer that will borrow subset of pointed records into the linking one to expand the indexing. Jinja will follow links (possibly by instantiating records on the fly?) and caching them.

When a given record is modified a signal will be triggered with the patch of the modification. Record pointing to the modified record will be consequently amended (if necesseray) by following the signal. This should better be detailed in a new RFC.

It is assumed that simple queries related to relations are going to be performed directly through elasticsearch (see #3232). More complex ones will bind into JSONB and depend on PostgreSQL usage.

aw-bib commented 9 years ago

@kaplun is recid refering to the record id of the instance in question?

This will run into trouble once linked records get interchanged between several instances. (Something we do all the time at join2.)

In your author's use case I may have "Doe, John" as 234 on instance a while it is 12345 on instance b while something more abstract like his orcid would stay the same across instances (as does John Doe).

kaplun commented 9 years ago

In the big picture of things you should not imagine you are going to expose exactly the low-level JSON you store inside Invenio as such. When exposing records via API, when sending them to elasticsearch etc. ideally there is going to be some transformation process that will enrich the exposed records (e.g. by borrowing parts of the referred records inline into the referring one)

inveniosoftware / invenio

RFC Centralizing record linking #2573

Current model in Invenio master

Citations

Authors

Affiliation

Photos/Album

Other links

Issues with the current model

Proposal

Side note

Short Summary

Features

Implementation