Open tsgit opened 5 years ago
very good comments @michamos thanks
right, I agree that schema_utils shouldn't deal with encoding issues -- which means there will be some sanitizing of random input in the crawler. It's not like the remote end serves stuff in a consistent encoding, it's random crap in the remote metadata -- so the crawler should understand the quirks of the source.
on the other hand you advocate for collaboration splitting and normalization in the utils, but then there is no deduping !?
if the input data has Virgo collaboration; Ligo collaboration; Virgo and Ligo collaborations
then the collaborations end up replicated
So I think LiteratureBuilder should ensure deduping of lists like collection
and collaborations
among others. That's beyond this PR, though.
I don't feel strongly about __method
vs. _method
, but I did actually follow advice from some python coding resources online about encapsulation. The one I linked above isn't the one I used, but it's comparable, and I think it makes a decent argument. It'll always be a problem when encapsulation is enforced by naming convention and not by code, though.
Signed-off-by: Thorsten Schwander thorsten.schwander@gmail.com
Description
This adds a LastRunSpider to crawl OSTI for records with SLAC association. The purpose is to satisfy an institutional mandate of having all SLAC HEP research represented in Inspire. Not all SLAC research output is on arXiv or other customarily harvested channels. OSTI is an additional channel to check.
Related Issue
Motivation and Context
Checklist:
RFC
and look for it).