paul-tqh-nguyen / arxiv_as_a_newspaper

arxiv.org portrayed as if it were a news paper.
0 stars 0 forks source link

More Robustly Handle arXiv errors #11

Closed paul-tqh-nguyen closed 5 years ago

paul-tqh-nguyen commented 5 years ago

We're seeing this when scraping occasionally:

<dl>
<dt>Error with 1906.02978</dt>
</dl>
<dl>
<dt>Error with 1906.02978</dt>
</dl>

Let's be robust against that.

I'll attach a more complete log below.

error_1.txt

To see the production of the above log, see the file below.

extract_transform_utilities.zip

paul-tqh-nguyen commented 5 years ago

Progress Patch: https://github.com/paul-tqh-nguyen/arxiv_as_a_newspaper/commit/d5db32e7d3755aec5aa88347e9f7bfde221c3793

This patch implements several robustifications.

We now have a timeout if we're waiting too long to get a result from a URL.

Sometimes the arXiv server has issues and gives us HTML with garbage like "

Error with 1906.02978
". The ETL used to raise unhandled exceptions in such situations. Those situations are now handled properly.

There were some typo fixes as well.