Open bnewbold opened 4 years ago
I am not sure which arXiv API is being used here, but I can see they are returning properly structured list of authors.
$ curl -i "http://export.arxiv.org/api/query?id_list=1409.1284"
HTTP/1.1 200 OK
Date: Thu, 17 Sep 2020 18:54:11 GMT
Server: Apache
Access-control-allow-origin: *
Vary: Accept-Encoding,User-Agent
Transfer-Encoding: chunked
Content-Type: application/atom+xml; charset=UTF-8
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3D%26id_list%3D1409.1284%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=&id_list=1409.1284&start=0&max_results=10</title>
<id>http://arxiv.org/api/4Aogd//oxmUL6yberwGVoBebXq0</id>
<updated>2020-09-17T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/abs/1409.1284v1</id>
<updated>2014-09-03T23:27:18Z</updated>
<published>2014-09-03T23:27:18Z</published>
<title>Improving Accessibility of Archived Raster Dictionaries of Complex
Script Languages</title>
<summary> We propose an approach to index raster images of dictionary pages which in
turn would require very little manual effort to enable direct access to the
appropriate pages of the dictionary for lookup. Accessibility is further
improved by feedback and crowdsourcing that enables highlighting of the
specific location on the page where the lookup word is found, annotation,
digitization, and fielded searching. This approach is equally applicable on
simple scripts as well as complex writing systems. Using our proposed approach,
we have built a Web application called "Dictionary Explorer" which supports
word indexes in various languages and every language can have multiple
dictionaries associated with it. Word lookup gives direct access to appropriate
pages of all the dictionaries of that language simultaneously. The application
has exploration features like searching, pagination, and navigating the word
index through a tree-like interface. The application also supports feedback,
annotation, and digitization features. Apart from the scanned images,
"Dictionary Explorer" aggregates results from various sources and user
contributions in Unicode. We have evaluated the time required for indexing
dictionaries of different sizes and complexities in the Urdu language and
examined various trade-offs in our implementation. Using our approach, a single
person can make a dictionary of 1,000 pages searchable in less than an hour.
</summary>
<author>
<name>Sawood Alam</name>
</author>
<author>
<name>Fateh ud din B Mehmood</name>
</author>
<author>
<name>Michael L. Nelson</name>
</author>
<arxiv:doi xmlns:arxiv="http://arxiv.org/schemas/atom">10.1145/2756406.2756926</arxiv:doi>
<link title="doi" href="http://dx.doi.org/10.1145/2756406.2756926" rel="related"/>
<arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">11 pages, 5 images, 2 codes, 1 table</arxiv:comment>
<link href="http://arxiv.org/abs/1409.1284v1" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/1409.1284v1" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.DL" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.DL" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.IR" scheme="http://arxiv.org/schemas/atom"/>
<category term="H.3.3" scheme="http://arxiv.org/schemas/atom"/>
</entry>
</feed>
We use the OAI-PMH feed, in the arXivRaw
schema, eg: http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:1409.1284&metadataPrefix=arXivRaw
If I recall correctly the reason for this is that the other schemas do not include information at the article-version level. OAI-PMH is also preferred for harvesting because we can pull daily updates. In theory we should be able to pull from multiple API endpoints and merge metadata, but that would be a larger change of the harvest/import pipeline.
This is a strange OAI-PMH API, they have metadataPrefix=arXivRaw
that includes version history, but author names consolidated while metadataPrefix=arXiv
that has author names separated, but no version history.
Yes. Also on the subject of arxiv author names, I think they have some sort of "canonical" representation as a string in their database somewhere, but not a unique identifier. They use this string to do author lookups (eg, if you click an author name on arxiv.org, it will try to show all papers by that author, and this might work better than a naive search for the author name string as listed in the PDF). I don't remember if this is documented. ORCID usage is not (yet) widespread enough to use as a true author identifier, but maybe that is changing and folks should require authors to have an ORCID when submitting.
My impression is that there has been extensive work in progress towards a new arxiv.org API, but that it hasn't launched yet. The transition from Cornell Libraries to the CS department I think resulted in a lot of staff turn over. Recent replies on the API discussion mailing list have come from unpaid folks (very appreciated!), not paid staff. If/when a new API is available which includes both granular author metadata and granular version metadata we would switch to that.
I think they have some sort of "canonical" representation as a string in their database somewhere, but not a unique identifier. They use this string to do author lookups (eg, if you click an author name on arxiv.org, it will try to show all papers by that author, and this might work better than a naive search for the author name string as listed in the PDF).
I think use last name then a comma, followed by initials of the first name and initials of middle name, if present. This form of canonicalization returns a lot of false positives. For example, it returns 206 results when clicking on my name, but only 8 results when searching for my full name.
Our arxiv harvester receives author metadata as a single string, with individual author names separated by commas and "and".
Here is the function: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/arxiv.py#L24
In some cases, as discovered by Sawood, this doesn't work and all author names come through as a single string. For example:
https://fatcat.wiki/release/c5s6d7f7w5b3himgditfbiu5nq https://fatcat.wiki/release/f7j4lf4aqfeqlaqfrtayt62rwe
FIxing this will include: