kermitt2 / article_dataset_builder

Open Access PDF harvester, metadata aggregator and full-text ingester
Apache License 2.0
55 stars 5 forks source link

Issue with harvest_dois #8

Open jameshowison opened 7 months ago

jameshowison commented 7 months ago

I'm running into an issue where harvest_pmcids works but harvest_dois does not. For pmcids the PDFs are gathered, but for harvest_dois they are not.

I have run into this with arxiv dois, but then I tried with the dois in the test folder in this project.

The symptom is that harvester.diagnostic(full=True) shows "total invalid PDF: 7" when I run with the test DOIs.

Any chance that something is broken in the doi list approach, but not in the pmcids approach?

kermitt2 commented 6 months ago

Hi @jameshowison !

The reason is that arXiv DOI are not CrossRef DOI, but DataCite DOI. This module only resolves CrossRef ones... So it results in 0 PDF found. This is the problem of the multiple new DOI providers, and the fact that preprint services now use these free DOIs.

I made something specific for arXiv https://github.com/kermitt2/arxiv_harvester for creating a full arXiv mirror, but not just for a few arXiv PDF.

jameshowison commented 6 months ago

Hmmm. Two things then,

  1. the DOI in https://github.com/kermitt2/article_dataset_builder/blob/master/test/dois.txt are also not working for me. Those aren't arxiv dois, are they?
  2. Where should the documentation show the issue with non-crossref dois? Maybe the method should be renamed harvest_crossref_dois? Is there some way to detect DOIs that the module can't obtain?

Looks like the arxiv DOIs work using arxiv_base from the config.harvester file if strip off arvix. from the front of the DOIs. Eg.

doi:10.48550/arxiv.1808.06161

works to get direct PDF via

https://arxiv.org/pdf/1808.06161