jannisborn / paperscraper

Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
MIT License
263 stars 31 forks source link

Graceful handling of connection errors #35

Closed jannisborn closed 11 months ago

jannisborn commented 11 months ago

Close #34

When scraping biorxiv/medrxiv, occasional connection error occurs, as described in #34. With this PR we handle such errors more gracefully and attempt up to max_retries retries to download the same batch of papers.

Version bump to 0.2.8

jannisborn commented 11 months ago

Still downloading but looks like this now:

>>> from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
WARNING:paperscraper.load_dumps: No dump found for biorxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dump found for chemrxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dump found for medrxiv. Skipping entry.
WARNING:paperscraper.load_dumps: No dumps found for either biorxiv or medrxiv. Consider using paperscraper.get_dumps.* to fetch the dumps.
>>> medrxiv()
5101it [03:59, 22.87it/s]ERROR:paperscraper.xrxiv.xrxiv_api:Connection error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')). Retrying (1/10)
26101it [24:57,  5.00it/s]