saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
313 stars 51 forks source link

Data download is interrupted after a few minutes #195

Open sert23 opened 1 year ago

sert23 commented 1 year ago

Describe the bug Not sure what's happening but for the last few days, I'm struggling to download data using pysradb. This used to work no problem a couple of weeks ago. Here is the error I get:

File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher [6/370] yield
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 567, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 533, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 460, in read
return self._read_chunked(amt) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 583, in _read_chunked chunk_left = self._get_chunk_left() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left chunk_left = self._read_next_chunk_size() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 526, in _read_next_chunk_size line = self.fp.readline(_MAXLINE + 1) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/socket.py", line 705, in readinto return self._sock.recv_into(b) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/ssl.py", line 1274, in recv_into return self.read(nbytes, buffer) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/ssl.py", line 1130, in read return self._sslobj.read(len, buffer) TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/eap/miRexpress/updates/code/run_update.py", line 200, in generate_raw_tsv("miRNA-seq", os.path.join(raw_folder, "miRNA-seq.tsv")) File "/home/eap/miRexpress/updates/code/run_update.py", line 36, in generate_raw_tsv instance.search() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 793, in search self._format_response(r.raw) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 861, in _format_response for event, elem in Et.iterparse(content): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/xml/etree/ElementTree.py", line 1255, in iterator data = source.read(16 * 1024) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 566, in read with self._error_catcher(): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 449, in _error_catcher raise ReadTimeoutError(self._pool, None, "Read timed out.")

It seems like it's getting disconnected after some minutes. Is there a parameter I can change to make it retry or something similar? Are they blocking my IP? Is this a widespread recent issue?

To Reproduce This really happen with any attempt now (randomly) after a few minutes. In this example I'm trying to download info about all miRNA-seq samples in SRA:

instance = SraSearch(2, 1000000 strategy="miRNA-seq") print("Downloading samples for " + library_type) instance.search()

Thanks a lot for writing this software and the support!!

sert23 commented 11 months ago

I am currently trying the same script again (previously working) and a different error happened this time.

Traceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 533, in _read_next_chunk$ size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 583, in _read_chunked
chunk_left = self._get_chunk_left() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 568, in _get_chunk_left raise IncompleteRead(b'') http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:çTraceback (most recent call last): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 444, i n _error_catcher yield File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 567, i n read data = self._fp_read(amt) if not fp_closed else b"" File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 533, i n _fp_read return self._fp.read(amt) if amt is not None else self._fp.read() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 460, in read return self._read_chunked(amt) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 598, in _read_chunked raise IncompleteRead(b''.join(value)) http.client.IncompleteRead: IncompleteRead(4336 bytes read) During handling of the above exception, another exception occurred: [34/826]

Traceback (most recent call last): File "/home/eap/miRexpress/updates/code/run_update.py", line 211, in generate_raw_tsv("miRNA-seq", os.path.join(raw_folder, "miRNA-seq.tsv")) File "/home/eap/miRexpress/updates/code/run_update.py", line 38, in generate_raw_tsv instance.search() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 793, in search self._format_response(r.raw) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 861, in _format_response for event, elem in Et.iterparse(content): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/xml/etree/ElementTree.py", line 1255, in iterat or data = source.read(16 * 1024) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 566, i n read with self._error_catcher(): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/contextlib.py", line 153, in exit self.gen.throw(typ, value, traceback) File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 461, i n _error_catcher raise ProtocolError("Connection broken: %r" % e, e) urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(4336 bytes read)', IncompleteRea d(4336 bytes read))

saketkc commented 11 months ago

My recommendation is to use an external tool for downloading for now: https://github.com/saketkc/pysradb/issues/201#issuecomment-1843076201

sert23 commented 11 months ago

sorry, I think my explanation was not clear. I'm trying to download only metadata.

saketkc commented 11 months ago

Is this what you are running (seems okay at my end):

>>> instance = SraSearch(2, 1000000, strategy="miRNA-seq")
>>> df = instance.search()  4%|█▍                                 | 5400/130053 [03:13<1:19:26, 26.15it/s]
sert23 commented 11 months ago

Yep, it starts running but it spits out this error after some minutes...

Traceback (most recent call last): File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left chunk_left = self._read_next_chunk_size() File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 533, in _read_next_chunk$ size return int(line, 16) ValueError: invalid literal for int() with base 16: b''

I'm guessing something is not formatted properly on SRA side (it happened to me when parsing something else from SRA in python). They include some '\b somewhere in the description fields and python tries to parse this as some kind of binary string....

As a workaround, I'm trying to run the same query on GEO to see if this is parsed differently by them. Alternatively, is there a way to do a SraSearch query but only request the summary fields? (SRX and SRP). This could work for me.

Thanks for your help!

saketkc commented 11 months ago

You could try with verbosity=1

sert23 commented 11 months ago

thank you, I will try that as last resource. The problem is I'm interested in all SRPs so then I would have to query sample by sample to retrieve since verbosity=1 only gives you experiment accessions.