Change code over to httpx

rafguns commented 1 year ago

Spun off from #1. I'll make notes here.

rafguns commented 1 year ago

The code currently works but we find substantially fewer results. Some notes from comparing the doi_fulltext tables:

In 350 cases, something went wrong before finding the fulltext URL: these are not in that table -> investigate by comparing the doi_meta tables

Some volatility in HTTP errors but nothing that really caught my eye:

status_code_req  status_code_httpx
200.0            200.0                591
                 403.0                 15
                 429.0                  7
401.0            200.0                  1
                 401.0                 32
403.0            200.0                 19
                 403.0                 51
                 429.0                 13
429.0            403.0                  4
                 429.0                 23

Errors in general. Again, nothing too suspicious:

error_req                          error2_httpx                                
HTTP error                         HTTP error                                      123
                                   HTTP error: The read operation timed out          1
                                   none                                             42
SSL error                          none                                              5
Time out or connection error       HTTP error: The read operation timed out          1
                                   none                                              1
Time out, URL or connection error  HTTP error: [Errno 11001] getaddrinfo failed      1
                                   none                                              5
none                               HTTP error                                       22
                                   HTTP error: The read operation timed out          6
                                   none                                            913

rafguns commented 1 year ago

OK, one possible cause is the fact that the default UA of httpx is sometimes blocked (e.g. by ScienceDirect). I just cheked in code to pose as Chrome.

rafguns commented 1 year ago

Fixed in https://github.com/rafguns/doidownloader/commit/61a16079094947136553dd58243cef365ebe2706

rafguns / doidownloader

Change code over to httpx #9