jannisborn / paperscraper

Tools to scrape publication metadata from pubmed, arxiv, medrxiv and chemrxiv.
MIT License
263 stars 31 forks source link

Remote diconnected and didnt download files #34

Closed bbanzai88 closed 11 months ago

bbanzai88 commented 11 months ago

Hi, Very cool project! It looks like I installed it correctly and I ran this code on a jupyter notebook:

from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
    medrxiv()  #  Takes ~30min and should result in ~35 MB file
    biorxiv()  # Takes ~1h and should result in ~350 MB file
    chemrxiv()  #  Takes ~45min and should result in ~20 MB file

I get this response:

61032it [20:29, 49.63it/s] 106700it [1:45:02, 16.93it/s]

And then I get the mess below. Any ideas on what I can do? Thankyou!!

Sincerely,

tom

RemoteDisconnected                        Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:785, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    783     e = ProtocolError("Connection aborted.", e)
--> 785 retries = retries.increment(
    786     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    787 )
    788 retries.sleep()

File ~\anaconda3\lib\site-packages\urllib3\util\retry.py:550, in Retry.increment(self, method, url, response, error, _pool, _stacktrace)
    549 if read is False or not self._is_method_retryable(method):
--> 550     raise six.reraise(type(error), error, _stacktrace)
    551 elif read is not None:

File ~\anaconda3\lib\site-packages\urllib3\packages\six.py:769, in reraise(tp, value, tb)
    768 if value.__traceback__ is not tb:
--> 769     raise value.with_traceback(tb)
    770 raise value

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:703, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    702 # Make the request on the httplib connection object.
--> 703 httplib_response = self._make_request(
    704     conn,
    705     method,
    706     url,
    707     timeout=timeout_obj,
    708     body=body,
    709     headers=headers,
    710     chunked=chunked,
    711 )
    713 # If we're going to release the connection in ``finally:``, then
    714 # the response doesn't need to know about the connection. Otherwise
    715 # it will also try to release it and we'll have a double-release
    716 # mess.

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:449, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    445         except BaseException as e:
    446             # Remove the TypeError from the exception chain in
    447             # Python 3 (including for exceptions like SystemExit).
    448             # Otherwise it looks like a bug in the code.
--> 449             six.raise_from(e, None)
    450 except (SocketTimeout, BaseSSLError, SocketError) as e:

File <string>:3, in raise_from(value, from_value)

File ~\anaconda3\lib\site-packages\urllib3\connectionpool.py:444, in HTTPConnectionPool._make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    443 try:
--> 444     httplib_response = conn.getresponse()
    445 except BaseException as e:
    446     # Remove the TypeError from the exception chain in
    447     # Python 3 (including for exceptions like SystemExit).
    448     # Otherwise it looks like a bug in the code.

File ~\anaconda3\lib\http\client.py:1377, in HTTPConnection.getresponse(self)
   1376 try:
-> 1377     response.begin()
   1378 except ConnectionError:

File ~\anaconda3\lib\http\client.py:320, in HTTPResponse.begin(self)
    319 while True:
--> 320     version, status, reason = self._read_status()
    321     if status != CONTINUE:

File ~\anaconda3\lib\http\client.py:289, in HTTPResponse._read_status(self)
    286 if not line:
    287     # Presumably, the server closed the connection before
    288     # sending a valid response.
--> 289     raise RemoteDisconnected("Remote end closed connection without"
    290                              " response")
    291 try:

ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:71, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     70 while do_loop:
---> 71     json_response = requests.get(
     72         self.get_papers_url.format(
     73             begin_date=begin_date, end_date=end_date, cursor=cursor
     74         )
     75     ).json()
     76     do_loop = json_response["messages"][0]["status"] == "ok"

File ~\anaconda3\lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
     63 r"""Sends a GET request.
     64 
     65 :param url: URL for the new :class:`Request` object.
   (...)
     70 :rtype: requests.Response
     71 """
---> 73 return request("get", url, params=params, **kwargs)

File ~\anaconda3\lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\anaconda3\lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~\anaconda3\lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)

File ~\anaconda3\lib\site-packages\requests\adapters.py:501, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    500 except (ProtocolError, OSError) as err:
--> 501     raise ConnectionError(err, request=request)
    503 except MaxRetryError as e:

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Input In [2], in <cell line: 3>()
      1 from paperscraper.get_dumps import biorxiv, medrxiv, chemrxiv
      2 medrxiv()  #  Takes ~30min and should result in ~35 MB file
----> 3 biorxiv()  # Takes ~1h and should result in ~350 MB file
      4 chemrxiv()

File ~\anaconda3\lib\site-packages\paperscraper\get_dumps\biorxiv.py:42, in biorxiv(begin_date, end_date, save_path)
     40 # dump all papers
     41 with open(save_path, "w") as fp:
---> 42     for index, paper in enumerate(
     43         tqdm(api.get_papers(begin_date=begin_date, end_date=end_date))
     44     ):
     45         if index > 0:
     46             fp.write(os.linesep)

File ~\anaconda3\lib\site-packages\tqdm\std.py:1195, in tqdm.__iter__(self)
   1192 time = self._time
   1194 try:
-> 1195     for obj in iterable:
   1196         yield obj
   1197         # Update and possibly print the progressbar.
   1198         # Note: does not call self.update(1) for speed optimisation.

File ~\anaconda3\lib\site-packages\paperscraper\xrxiv\xrxiv_api.py:85, in XRXivApi.get_papers(self, begin_date, end_date, fields)
     83                 yield processed_paper
     84 except Exception as exc:
---> 85     raise RuntimeError(
     86         "Failed getting papers: {} - {}".format(exc.__class__.__name__, exc)
     87     )

RuntimeError: Failed getting papers: ConnectionError - ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
jannisborn commented 11 months ago

Hi @bbanzai88, thanks for the interest in the library. I can reproduce this ConnectionError locally when calling medrxiv().

This is a problem with the API of medrxiv that is hard to directly mitigate in paperscraper. One intermediate solution can be to treat the try/except more gracefully here and then add some time.sleep if the API does not respond

Longterm best solution is to solve #33, this would imply that users dont have to download data to local disk anymore and directly query the dumps from a server where I host them. That would require more time

memray commented 6 months ago

Hi @jannisborn , I'm facing the same ConnectionError issue after scraping a few hundred records. I wonder if you can share the dump somewhere so we do not have to scrape it repeatedly?

Thanks,

jannisborn commented 6 months ago

Hi @memray,

Which version of paperscraper do you use?

jannisborn commented 6 months ago

There is no update yet on #33 but if you have the time to prepare a PR, it would be great and I'm happy to assist

memray commented 6 months ago

Hi @jannisborn ,

I created a PR to resolve the timed-out issue and please review it. It's working pretty well for now (except for lots of Connection error printouts).