inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

Reference matching 101 #686

Open kaplun opened 8 years ago

kaplun commented 8 years ago

Disclaimer

Here I describe current(/wished) state of the art for matching references against records. This issue should summarize ideally all the tricks scattered around refextract/bibrank/bibformat, and unhide all the unwritten rules.

Sources

References are extracted in automatic fashion by refextract heuristics, and Grobid machine learning. They are also provided in a structured way by publishers.

Reference structure

A reference is composed of the following items

DOIs and arXiv IDs are equivalent in their uniqueness with respect to matching. When len(DOIs) + len(arXiv IDs) > 1, they should all match the same unique record. Otherwise a ticket should be raised for a cataloguer to check. In fact the DOI of an erratum should be stored on the same record of the paper it corrects.

Note that both DOIs and arXiv IDs need some simple normalization both in the reference and in the referred record. In particular:

Reportnumbers are weak way of matching a record. Ideally they should only be used as a last resort, and only if the matched record is unique.

Journal titles

These should be normalized to their shortest form. Currently this is done by checking against knowledge base, but we should ideally merge it with the journal database, and start maintain all the information there.

Volume, issue, year, page ranges

These piece of information could come as ranges. However to perform a match, only the starting element of the range is necessary. For this reason, referred records should be indexed both with the full range (when available) and with the start element only (e.g. for volume 53-54 we would index the record under 53-54, and 53. Absolutely not under 54!) When matching a reference one would simply strip away the second part of the reference. first try to match using the full range, and in case of no match try by stripping the second part of the range.

Pubnotes

If not because of DOIs and arXiv IDs, most records are matchable by a pubnote. A pubnote is infact the minimal piece of information among journal title, volume, year, issue and page, necessary to uniquely identify a paper.

It is proposed that in order to match a paper one would first:

Ideally, if DOI/arXiv already matched a unique document, then:

One can try to match a paper also via ISBN and e.g. the editor information, but YMMV.

kaplun commented 8 years ago

cc: @inspirehep/inspire-dir

tsgit commented 8 years ago

I note that journal lookup from a reference in 999C5s currently uses journal index which utilizes a KB for synonyms The reverse, looking for a pubnote in 999C5s does exact string match on the standardized pubnote. This misses variations in abbreviated journal name, journal names with varying amounts of whitespace, alternative spellings of journal names, etc.

so "classic" bibrank is not symmetric between citer and citee and this leads to citation flip-flop

kaplun commented 8 years ago

What if, in Labs, we stop using pubnotes, and rather store the several components separated, and indeed always normalize a journal title name to its default name? Then the matching could be done bidirectionally...

fschwenn commented 8 years ago

As I said on stand-up I also started to write something (offline). Most of it already is covered by Sam's description - tanks a lot!

A few additions:

kaplun commented 8 years ago

As I said on stand-up I also started to write something (offline).

Ouch! Right! :flushed: Sorry! I needed to write down all the scattered concepts bubbling in my head and forgot about that.

I will incorporate your comments.

arxivIDs are not unique identifiers within INSPIRE! At least the way we run INSPIRE at the moment.

Are you sure this is not already enforced by BibUpload, refusing to upload a record having the same arXiv ID of an existing record? (via the 035 check).

They are not surjective: resubmission of preprints instead of versioning should be merged on INSPIRE. I know, it's against the rules of arXiv to do such thing, but it happens.

This is OK: after the merge the record will have one DOI and 2 arXiv IDs, thus it can be matched by DOI + arXiv ID1 or DOI + arXiv ID2.

As Annette pointed out, some kind of normalization would help (i.e. s/\//-/g).

I assume just at indexing/search time, but we are going to preserve the reportnumber as such in metadata, right?

For citation counting a report number pointing to two records could count for both of them (report split into two records), but an asana ticket should be created to check it.

:open_mouth: this is the first time I hear of a reference allowed to match more than 1 record. That means that we would need to amend the Schema to allow a list of recids, per reference, rather than a simple integers.

One could think of refextract moving a 999C5s of an erratum to the marc field of the original article which would be nice for the display - otherwise a 999C5 pointing to severeal records could not be displayed as a link to one record but would need to be a link to several records from which the user would need to decide which one was meant.

I am not sure to fully understand this last part. Can you make an example?

BTW This is actually important: When we display a resolved reference (e.g. in the reference tab), we format the reference by using metadata from the referred record. Would be nice to take into consideration here, whether the reference was made through an errata pubnote/DOI rather than just the direct pubnote/DOI.

ksachs commented 8 years ago

only the starting element of the range is necessary

Not really true: a book starts on page 1 as well as the 1-st chapter. We artificially introduced pp.1-123 to make page 1 unique. E.g.

001237271 773__ $$cpp.1-624$$pLect.Notes Phys.$$v871$$y2013
001204463 773__ $$c1-11$$pLect.Notes Phys.$$v871$$y2013
kaplun commented 8 years ago

Mmh. In your example, 1204463 is a book chapter of 1237271. In a corresponding reference, either we know the reference points to the Book Vs. the Chapter, or we have then to use the full range (when available). I am not sure the current convention of adding the pp. prefix in 773__c is going to be still useful in the future, though. Can we decide to remove such prefix since the information is already present in the type of record at hand? In this way it would simplify matching...

I have amended the original issue following your note.

ksachs commented 8 years ago

either we know the reference points to the Book Vs. the Chapter

How??

 001276834 999C5 $$01237271$$hD. E. Kharzeev, K. Landsteiner, A. Schmitt and H. -U. Yee$$o11$$sLect.Notes Phys.,871,1$$y2013

you only have the authors in addition to the PBN

kaplun commented 8 years ago

(PBN = PubNote?)

Yep, I assume indeed we never know it (from refextract). So in that case the only discriminant is the author list, or alternatively raise Asana ticket...

fschwenn commented 8 years ago

One could think of refextract moving a 999C5s of an erratum to the marc field of the original article which would be nice for the display - otherwise a 999C5 pointing to severeal records could not be displayed as a link to one record but would need to be a link to several records from which the user would need to decide which one was meant.

I am not sure to fully understand this last part. Can you make an example?

https://inspirehep.net/record/339178/references The first two references point to the original article and its Erratum, i.e. to the same INSPIRE record. If they were combined in one marc field (or whathever structure), the system would know about and could display it just once.

kaplun commented 8 years ago

@mihaibivol with your work you are indirectly implementing what is described here IMHO.