inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

hepcrawl: add crawler for OSTI #276

Open tsgit opened 5 years ago

tsgit commented 5 years ago
* use API at OSTI to harvest records associated with SLAC

Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com

Description

This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.

Related Issue

Motivation and Context

Checklist:

tsgit commented 5 years ago

very good comments @michamos thanks

tsgit commented 5 years ago

right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.

on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !? if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations then the collaborations end up replicated

So I think LiteratureBuilder should ensure deduping of lists like collection and collaborations among others. That's beyond this PR, though.

I don't feel strongly about __method vs. _method, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.