Closed paul-tqh-nguyen closed 5 years ago
Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/d5db32e7d3755aec5aa88347e9f7bfde221c3793
This patch implements several robustifications.
We now have a timeout if we're waiting too long to get a result from a URL.
Sometimes the arXiv server has issues and gives us HTML with garbage like "
There were some typo fixes as well.
We're seeing this when scraping occasionally:
Let's be robust against that.
I'll attach a more complete log below.
error_1.txt
To see the production of the above log, see the file below.
extract_transform_utilities.zip