inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

Q: arXiv spider: collaborations #248

Closed ksachs closed 5 years ago

ksachs commented 6 years ago

Question

can I raise a flag / create ticket / send email ... if there is something fishy when processing an xml in a spider? I.e. chances are everything is OK, but a cataloger should have a closer look. I don't want to crash it.

Problem

<author><keyname>:</keyname></author> comes in at least 2 varieties:

I can fix the spider to deal with both cases.

But I have no idea whether there are (will be) other cases. And it's impossible to spot the name of a collaboration amongst several hundred authors if it is misidentified as author. Therefore I would like to get a warning for records with author names ":"

Any other good idea is welcome. I even take bad ideas.

Example 1

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1808.04927&metadataPrefix=arXiv

<author>
  <keyname>Zwaska</keyname>
  <forenames>R.</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>Group</keyname>
  <forenames>The T2K Beam</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>:</keyname>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author>
<author>
  <keyname>Berns</keyname>
  <forenames>L.</forenames>
  <affiliation>The NA61/SHINE Collaboration</affiliation>
</author> 

Example 2

http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1607.01177&metadataPrefix=arXiv

<authors>
  <author>
    <keyname>Bay</keyname>
    <forenames>Daya</forenames>
  </author>
  <author>
    <keyname>Collaborations</keyname>
    <forenames>MINOS</forenames>
  </author>
  <author>
    <keyname>:</keyname>
  </author>
  <author>
    <keyname>Adamson</keyname>
    <forenames>P.</forenames>
  </author>
ksachs commented 5 years ago

replaced by https://github.com/inspirehep/hepcrawl/issues/251