inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

arXiv spider: collaborations #251

Closed ksachs closed 5 years ago

ksachs commented 5 years ago

Problem

In arXiv metadata there is no field for collaborations, they are mixed in the author name and affiliation. <author><keyname>:</keyname></author> comes in at least 2 varieties:

Current Behavior

The last variety is not parsed properly, all authors before the ':' end up in the collaboration fields.

Expected Behavior

In addition to parsing authors and collaborations properly it would be nice to

Status of https://github.com/inspirehep/hepcrawl/pull/250/commits/f8e36800fd130b665795eac54afc15c1b9041d58

To add a warning to the cataloger and tag problematic records:

hepcrawl/tohep.py is changed similar to inspire-next/inspirehep/modules/literaturesuggest/tasks.py which is using private_notes

Example 1

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1808.04927&metadataPrefix=arXiv

<author>
  <keyname>Zwaska</keyname>
  <forenames>R.</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>Group</keyname>
  <forenames>The T2K Beam</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>:</keyname>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>Berns</keyname>
  <forenames>L.</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author> 

Example 2

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1607.01177&metadataPrefix=arXiv

<authors>
  <author>
    <keyname>Bay</keyname>
    <forenames>Daya</forenames>
  </author>
  <author>
    <keyname>Collaborations</keyname>
    <forenames>MINOS</forenames>
  </author>
  <author>
    <keyname>:</keyname>
  </author>
  <author>
    <keyname>Adamson</keyname>
    <forenames>P.</forenames>
  </author>