Closed rafguns closed 1 year ago
The code currently works but we find substantially fewer results. Some notes from comparing the doi_fulltext
tables:
In 350 cases, something went wrong before finding the fulltext URL: these are not in that table -> investigate by comparing the doi_meta
tables
Some volatility in HTTP errors but nothing that really caught my eye:
status_code_req status_code_httpx
200.0 200.0 591
403.0 15
429.0 7
401.0 200.0 1
401.0 32
403.0 200.0 19
403.0 51
429.0 13
429.0 403.0 4
429.0 23
Errors in general. Again, nothing too suspicious:
error_req error2_httpx
HTTP error HTTP error 123
HTTP error: The read operation timed out 1
none 42
SSL error none 5
Time out or connection error HTTP error: The read operation timed out 1
none 1
Time out, URL or connection error HTTP error: [Errno 11001] getaddrinfo failed 1
none 5
none HTTP error 22
HTTP error: The read operation timed out 6
none 913
OK, one possible cause is the fact that the default UA of httpx is sometimes blocked (e.g. by ScienceDirect). I just cheked in code to pose as Chrome.
Spun off from #1. I'll make notes here.