inspirehep / inspire-next

The INSPIRE repo.
https://inspirehep.net
GNU General Public License v3.0
59 stars 69 forks source link

holdingpen: rejected arXiv records that did not go via holdingpen #3173

Open ksachs opened 6 years ago

ksachs commented 6 years ago

Expected Behavior

Old arXiv articles that were rejected via DESY workflow (i.e. are neither in INSPIRE nor in holdigpen) should be autorejected. E.g. treat everything with an arXiv identifyer of 2017 or older that is not in INSPIRE as rejected.

There will be a similar problem when we start journal harvesting.

Current Behavior / Example

https://labs.inspirehep.net/holdingpen/884920 e-Print: arXiv:1705.03462 [astro-ph.GA] - PDF (revised version) is not in INSPIRE, was rejected via DESY workflow. It is halted for curator approval.

Context

This is esp. nasty because the arXiv ID is not shown in brief listing and the record looks new.

Screenshots (if appropriate):

old_article

michamos commented 6 years ago

Any solution would need to be general enough to handle also journals, where it might be harder to decide on a time basis. @ksachs maybe you could produce a list of already rejected arXiv IDs/DOIs?

ksachs commented 6 years ago

for arXiv: any article that is not in INSPIRE (incl. deleted records). Btw. would a revised version of a deleted record find that record or create a new one? I can give you a list, but I'm not sure that makes sense. @tsgit is there a good way to get all arXiv IDs from arXiv?

For journals we can provide a list of DOIs. But not everything has a DOI. Do you have a scheme how to identify those?

tsgit commented 6 years ago

you can use OAI-PMH ListIdentifiers http://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=arXiv or http://export.arxiv.org/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc which comes in chunks of 10k, so is relatively quick to get all 1.3 million. However you only get set membership, not exact categories

<OAI-PMH xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"><responseDate>2018-02-12T16:22:34Z</responseDate><request verb="ListIdentifiers" metadataPrefix="oai_dc">http://export.arxiv.org/oai2</request><ListIdentifiers>
<header>
<identifier>oai:arXiv.org:0704.0001</identifier>
<datestamp>2008-11-26</datestamp>
<setSpec>physics:hep-ph</setSpec>
</header>
<header>
<identifier>oai:arXiv.org:0704.0002</identifier>
<datestamp>2008-12-13</datestamp>
<setSpec>cs</setSpec>
<setSpec>math</setSpec>
</header>
<header>
<identifier>oai:arXiv.org:0704.0003</identifier>
<datestamp>2008-01-13</datestamp>
<setSpec>physics:physics</setSpec>
</header>
....
<resumptionToken cursor="0" completeListSize="1357057">2416028|10001</resumptionToken></ListIdentifiers></OAI-PMH>
tsgit commented 6 years ago

@ksachs

$ python ./arxivids.py 
harvesting via ListIdentifiers from arXiv between 2018-02-11 and 2018-02-12
oai:arXiv.org:1802.03098        cs
oai:arXiv.org:1802.03094        physics:cond-mat
oai:arXiv.org:1710.03117        cs, math
oai:arXiv.org:1802.03096        math, physics:cond-mat, physics:math-ph
oai:arXiv.org:1802.03097        math
oai:arXiv.org:1802.03090        physics:astro-ph
oai:arXiv.org:1802.03091        physics:nucl-ex
oai:arXiv.org:1802.03093        math
...
#!/usr/bin/python

import argparse
import sys

from datetime import date, timedelta
from dateutil.parser import parse
from sickle import Sickle

def get_arxiv_ids(start=None, end=None):
    """
    Get a list of arXiv records last modified between start and end dates
    and produce a list with setSpecs
    """
    if start is None:
        return

    oaiargs = {
        'metadataPrefix': 'arXivRaw',
        'from': start
    }
    if end is not None:
        oaiargs['until'] = end

    sickle = Sickle('http://export.arxiv.org/oai2')

    try:
        records = sickle.ListIdentifiers(**oaiargs)
    except Error as e:
        print e

    arXivIds = {}
    for rec in records:
        arxid = rec.identifier
        asets = rec.setSpecs
        arXivIds[arxid] = asets
    return arXivIds

if __name__ == '__main__':

    parser = argparse.ArgumentParser(description="""
    Check Inspire holdings for coverage of arXiv Core categories in specified OAI-PMH date window
    """)
    parser.add_argument('-f', '--from',
                        type=parse,
                        dest='start',
                        default=(date.today() - timedelta(1)).isoformat(),
                        help='from argument to OAI-PMH verbs, ISO 8601 e.g. 2018-01-20')
    parser.add_argument('-u', '--until',
                        type=parse,
                        nargs='?',
                        dest='end',
                        default=date.today().isoformat(),
                        help='until argument to OAI-PMH verbs, ISO 8601 e.g. 2018-01-21')

    args = parser.parse_args()

    print("harvesting via ListIdentifiers from arXiv between %s and %s" %
          (args.start.date(), args.end.date()))

    ids = get_arxiv_ids(args.start.date(), args.end.date())
    for id, sets in ids.iteritems():
        print "{}\t{}".format(id, ', '.join(sets))
ksachs commented 6 years ago

@michamos list of rejected arXiv-IDs is in /afs/cern.ch/project/inspire/uploads/arxiv.rejected.gz not including records of this month 1802

ksachs commented 6 years ago

Maybe this is related: What triggered the harvest of https://inspirehep.net/record/1648878

001648878 035__ $$9arXiv$$aoai:arXiv.org:hep-ph/9607356
001648878 037__ $$9arXiv$$ahep-ph/9607356$$chep-ph
001648878 037__ $$aUDEA-96-52
001648878 100__ $$aRestrepo, D.A.
001648878 245__ $$9arXiv$$aFrom hierarchical radiative quark mass matrices and mixings to FCNC in $SU(3)_c\times SU(2)_L\times U(1)_Y\times U(1)_H$ 
001648878 269__ $$c1996-07-17

That record was withdrawn and deleted https://inspirehep.net/record/420876/export/hm

michamos commented 6 years ago

@ksachs That doesn't seem to have anything to do with Labs. If you look at the first version of the record, its 541 field looks like the thing the legacy OAI-harvester would write, not hepcrawl or he workflow.

ksachs commented 6 years ago

[1604.08842] (https://labs.inspirehep.net/holdingpen/list/?page=1&size=10&q=1604.08842) must have been rejected via the holdingpen. It is not in INSPIRE and we were harvesting via labs in April. Was it deleted because it was in error state or something? Updates coming in now should be auto-rejected.

michamos commented 6 years ago

It's from April 2016, not 2018. Why do you think it must have been harvested via labs?

ksachs commented 6 years ago

because I can't distinguish 6 from 8. Sorry.