In arXiv metadata there is no field for collaborations, they are mixed in the author name and affiliation.
<author><keyname>:</keyname></author>
comes in at least 2 varieties:
everything before is a list of collaborations (if there are no affiliations)
the 'author' before is a collaboration (if there are affiliations)
Current Behavior
The last variety is not parsed properly, all authors before the ':' end up in the collaboration fields.
Expected Behavior
In addition to parsing authors and collaborations properly it would be nice to
add a warning to the cataloger i.e. as private_note
tag records having <author><keyname>:</keyname></author> in the arXiv metadata, to check them later
Problem
In arXiv metadata there is no field for collaborations, they are mixed in the author name and affiliation.
<author><keyname>:</keyname></author>
comes in at least 2 varieties:Current Behavior
The last variety is not parsed properly, all authors before the ':' end up in the collaboration fields.
Expected Behavior
In addition to parsing authors and collaborations properly it would be nice to
<author><keyname>:</keyname></author>
in the arXiv metadata, to check them laterStatus of https://github.com/inspirehep/hepcrawl/pull/250/commits/f8e36800fd130b665795eac54afc15c1b9041d58
To add a warning to the cataloger and tag problematic records:
items.py
andtohep.py
hepcrawl/tohep.py
is changed similar toinspire-next/inspirehep/modules/literaturesuggest/tasks.py
which is using private_notesExample 1
http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1808.04927&metadataPrefix=arXiv
Example 2
http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1607.01177&metadataPrefix=arXiv