inspirehep / hepcrawl

Scrapy project for feeds into INSPIRE-HEP
http://inspirehep.net
Other
17 stars 30 forks source link

PoS spider: use OAI-PMH spider #265

Open vbalbp opened 5 years ago

vbalbp commented 5 years ago

Signed-off-by: Victor Balbuena vbalbp@gmail.com

vbalbp commented 5 years ago

The spider is working just fine, both the normal and the single spiders. The tests are failing though because the new adaption completely breaks what was there. Apart from that, functional cds and arxiv fail because of the removal of

# Allow duplicate requests
DUPEFILTER_CLASS = "scrapy.dupefilters.BaseDupeFilter"

However, since we harvest the proceedings page as well as the paper, we get the proceedings multiple times in one run, since it gets it once per each record, even if it's the same proceedings for every record (That is the usual case when harvesting by sets, since sets are conferences). By removing that line, we get the proceedings record only once instead of multiple times.