Open ksachs opened 6 years ago
Any solution would need to be general enough to handle also journals, where it might be harder to decide on a time basis. @ksachs maybe you could produce a list of already rejected arXiv IDs/DOIs?
for arXiv: any article that is not in INSPIRE (incl. deleted records). Btw. would a revised version of a deleted record find that record or create a new one? I can give you a list, but I'm not sure that makes sense. @tsgit is there a good way to get all arXiv IDs from arXiv?
For journals we can provide a list of DOIs. But not everything has a DOI. Do you have a scheme how to identify those?
you can use OAI-PMH ListIdentifiers
http://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=arXiv
or
http://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc
which comes in chunks of 10k, so is relatively quick to get all 1.3 million. However you only get set
membership, not exact categories
<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2018-02-12T16:22:34Z</responseDate><request verb="ListIdentifiers" metadataPrefix="oai_dc">http://export.arxiv.org/oai2</request><ListIdentifiers>
<header>
<identifier>oai:arXiv.org:0704.0001</identifier>
<datestamp>2008-11-26</datestamp>
<setSpec>physics:hep-ph</setSpec>
</header>
<header>
<identifier>oai:arXiv.org:0704.0002</identifier>
<datestamp>2008-12-13</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<header>
<identifier>oai:arXiv.org:0704.0003</identifier>
<datestamp>2008-01-13</datestamp>
<setSpec>physics:physics</setSpec>
</header>
....
<resumptionToken cursor="0" completeListSize="1357057">2416028|10001</resumptionToken></ListIdentifiers></OAI-PMH>
@ksachs
$ python ./arxivids.py
harvesting via ListIdentifiers from arXiv between 2018-02-11 and 2018-02-12
oai:arXiv.org:1802.03098 cs
oai:arXiv.org:1802.03094 physics:cond-mat
oai:arXiv.org:1710.03117 cs, math
oai:arXiv.org:1802.03096 math, physics:cond-mat, physics:math-ph
oai:arXiv.org:1802.03097 math
oai:arXiv.org:1802.03090 physics:astro-ph
oai:arXiv.org:1802.03091 physics:nucl-ex
oai:arXiv.org:1802.03093 math
...
#!/usr/bin/python
import argparse
import sys
from datetime import date, timedelta
from dateutil.parser import parse
from sickle import Sickle
def get_arxiv_ids(start=None, end=None):
"""
Get a list of arXiv records last modified between start and end dates
and produce a list with setSpecs
"""
if start is None:
return
oaiargs = {
'metadataPrefix': 'arXivRaw',
'from': start
}
if end is not None:
oaiargs['until'] = end
sickle = Sickle('http://export.arxiv.org/oai2')
try:
records = sickle.ListIdentifiers(**oaiargs)
except Error as e:
print e
arXivIds = {}
for rec in records:
arxid = rec.identifier
asets = rec.setSpecs
arXivIds[arxid] = asets
return arXivIds
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="""
Check Inspire holdings for coverage of arXiv Core categories in specified OAI-PMH date window
""")
parser.add_argument('-f', '--from',
type=parse,
dest='start',
default=(date.today() - timedelta(1)).isoformat(),
help='from argument to OAI-PMH verbs, ISO 8601 e.g. 2018-01-20')
parser.add_argument('-u', '--until',
type=parse,
nargs='?',
dest='end',
default=date.today().isoformat(),
help='until argument to OAI-PMH verbs, ISO 8601 e.g. 2018-01-21')
args = parser.parse_args()
print("harvesting via ListIdentifiers from arXiv between %s and %s" %
(args.start.date(), args.end.date()))
ids = get_arxiv_ids(args.start.date(), args.end.date())
for id, sets in ids.iteritems():
print "{}\t{}".format(id, ', '.join(sets))
@michamos list of rejected arXiv-IDs is in /afs/cern.ch/project/inspire/uploads/arxiv.rejected.gz
not including records of this month 1802
Maybe this is related: What triggered the harvest of https://inspirehep.net/record/1648878
001648878 035__ $$9arXiv$$aoai:arXiv.org:hep-ph/9607356
001648878 037__ $$9arXiv$$ahep-ph/9607356$$chep-ph
001648878 037__ $$aUDEA-96-52
001648878 100__ $$aRestrepo, D.A.
001648878 245__ $$9arXiv$$aFrom hierarchical radiative quark mass matrices and mixings to FCNC in $SU(3)_c\times SU(2)_L\times U(1)_Y\times U(1)_H$
001648878 269__ $$c1996-07-17
That record was withdrawn and deleted https://inspirehep.net/record/420876/export/hm
@ksachs That doesn't seem to have anything to do with Labs. If you look at the first version of the record, its 541 field looks like the thing the legacy OAI-harvester would write, not hepcrawl or he workflow.
[1604.08842] (https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1604.08842) must have been rejected via the holdingpen. It is not in INSPIRE and we were harvesting via labs in April. Was it deleted because it was in error state or something? Updates coming in now should be auto-rejected.
It's from April 2016, not 2018. Why do you think it must have been harvested via labs?
because I can't distinguish 6 from 8. Sorry.
Expected Behavior
Old arXiv articles that were rejected via DESY workflow (i.e. are neither in INSPIRE nor in holdigpen) should be autorejected. E.g. treat everything with an arXiv identifyer of 2017 or older that is not in INSPIRE as rejected.
There will be a similar problem when we start journal harvesting.
Current Behavior / Example
https://labs.inspirehep.net/holdingpen/884920 e-Print: arXiv:1705.03462 [astro-ph.GA] - PDF (revised version) is not in INSPIRE, was rejected via DESY workflow. It is halted for curator approval.
Context
This is esp. nasty because the arXiv ID is not shown in brief listing and the record looks new.
Screenshots (if appropriate):