arxiv: improve author name parsing ("and")

bnewbold commented 4 years ago

Our arxiv harvester receives author metadata as a single string, with individual author names separated by commas and "and".

Here is the function: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/arxiv.py#L24

In some cases, as discovered by Sawood, this doesn't work and all author names come through as a single string. For example:

https://fatcat.wiki/release/c5s6d7f7w5b3himgditfbiu5nq https://fatcat.wiki/release/f7j4lf4aqfeqlaqfrtayt62rwe

FIxing this will include:

[ ] handling and testing these cases in the parsing function
[ ] sampling for additional author patterns which are not getting parsed correctly
[ ] run a cleanup task over existing release entities

ibnesayeed commented 4 years ago

I am not sure which arXiv API is being used here, but I can see they are returning properly structured list of authors.

$ curl -i "http://export.arxiv.org/api/query?id_list=1409.1284"
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 18:54:11 GMT
Server: Apache
Access-control-allow-origin: *
Vary: Accept-Encoding,User-Agent
Transfer-Encoding: chunked
Content-Type: application/atom+xml; charset=UTF-8

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D1409.1284%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=&amp;id_list=1409.1284&amp;start=0&amp;max_results=10</title>
  <id>http://arxiv.org/api/4Aogd//oxmUL6yberwGVoBebXq0</id>
  <updated>2020-09-17T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/abs/1409.1284v1</id>
    <updated>2014-09-03T23:27:18Z</updated>
    <published>2014-09-03T23:27:18Z</published>
    <title>Improving Accessibility of Archived Raster Dictionaries of Complex
  Script Languages</title>
    <summary>  We propose an approach to index raster images of dictionary pages which in
turn would require very little manual effort to enable direct access to the
appropriate pages of the dictionary for lookup. Accessibility is further
improved by feedback and crowdsourcing that enables highlighting of the
specific location on the page where the lookup word is found, annotation,
digitization, and fielded searching. This approach is equally applicable on
simple scripts as well as complex writing systems. Using our proposed approach,
we have built a Web application called "Dictionary Explorer" which supports
word indexes in various languages and every language can have multiple
dictionaries associated with it. Word lookup gives direct access to appropriate
pages of all the dictionaries of that language simultaneously. The application
has exploration features like searching, pagination, and navigating the word
index through a tree-like interface. The application also supports feedback,
annotation, and digitization features. Apart from the scanned images,
"Dictionary Explorer" aggregates results from various sources and user
contributions in Unicode. We have evaluated the time required for indexing
dictionaries of different sizes and complexities in the Urdu language and
examined various trade-offs in our implementation. Using our approach, a single
person can make a dictionary of 1,000 pages searchable in less than an hour.
</summary>
    <author>
      <name>Sawood Alam</name>
    </author>
    <author>
      <name>Fateh ud din B Mehmood</name>
    </author>
    <author>
      <name>Michael L. Nelson</name>
    </author>
    <arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1145/2756406.2756926</arxiv:doi>
    <link title="doi" href="http://dx.doi.org/10.1145/2756406.2756926" rel="related"/>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">11 pages, 5 images, 2 codes, 1 table</arxiv:comment>
    <link href="http://arxiv.org/abs/1409.1284v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/1409.1284v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.DL" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cs.DL" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cs.IR" scheme="http://arxiv.org/schemas/atom"/>
    <category term="H.3.3" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
</feed>

bnewbold commented 4 years ago

We use the OAI-PMH feed, in the arXivRaw schema, eg: http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1409.1284&metadataPrefix=arXivRaw

If I recall correctly the reason for this is that the other schemas do not include information at the article-version level. OAI-PMH is also preferred for harvesting because we can pull daily updates. In theory we should be able to pull from multiple API endpoints and merge metadata, but that would be a larger change of the harvest/import pipeline.

ibnesayeed commented 4 years ago

This is a strange OAI-PMH API, they have metadataPrefix=arXivRaw that includes version history, but author names consolidated while metadataPrefix=arXiv that has author names separated, but no version history.

bnewbold commented 4 years ago

Yes. Also on the subject of arxiv author names, I think they have some sort of "canonical" representation as a string in their database somewhere, but not a unique identifier. They use this string to do author lookups (eg, if you click an author name on arxiv.org, it will try to show all papers by that author, and this might work better than a naive search for the author name string as listed in the PDF). I don't remember if this is documented. ORCID usage is not (yet) widespread enough to use as a true author identifier, but maybe that is changing and folks should require authors to have an ORCID when submitting.

My impression is that there has been extensive work in progress towards a new arxiv.org API, but that it hasn't launched yet. The transition from Cornell Libraries to the CS department I think resulted in a lot of staff turn over. Recent replies on the API discussion mailing list have come from unpaid folks (very appreciated!), not paid staff. If/when a new API is available which includes both granular author metadata and granular version metadata we would switch to that.

ibnesayeed commented 4 years ago

I think they have some sort of "canonical" representation as a string in their database somewhere, but not a unique identifier. They use this string to do author lookups (eg, if you click an author name on arxiv.org, it will try to show all papers by that author, and this might work better than a naive search for the author name string as listed in the PDF).

I think use last name then a comma, followed by initials of the first name and initials of middle name, if present. This form of canonicalization returns a lot of false positives. For example, it returns 206 results when clicking on my name, but only 8 results when searching for my full name.

internetarchive / fatcat

arxiv: improve author name parsing ("and") #61