Closed anwarMZ closed 4 years ago
To add to this, running the script again caught different exception -
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
response.begin()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
version, status, reason = self._read_status()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 724, in urlopen
retries = retries.increment(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/util/retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
response.begin()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
version, status, reason = self._read_status()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
r = call_item()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
return self.fn(*self.args, **self.kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
return [func(*args, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in <listcomp>
return [func(*args, **kwargs)
File "/projects/test/parallel_download_pysradb.py", line 8, in single_download
db.download(df=df_single, skip_confirmation=True)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/sradb.py", line 1318, in download
file_sizes = df.apply(get_file_size, axis=1)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
return op.get_result()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
result = libreduction.compute_reduction(
File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/download.py", line 54, in get_file_size
return float(requests.head(url).headers["content-length"])
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 104, in head
return request('head', url, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/projects/test/parallel_download_pysradb.py", line 34, in <module>
Parallel(n_jobs=jobs)(
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 1042, in __call__
self.retrieve()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Appreciate your help with this. Thanks, Zohaib
If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from? It resumes downloads as long as
x.sra.part
file exists. In the error you posted it seems it doesn't. I am not able to replicate this at my end unfortunately. So I am not sure how to help. If you have any opinion with using example mentioned here on SunGridEngine based job queue system? You will have to write a custom script to take a SRP, split it into subsets and submit the subset dataframe topysradb download
. For example if you want to download only one SRR:
pysradb metadata SRR12100406 --detailed | pysradb download
(1)
You can get a list of SRRs using:
pysradb srp-to-srr SRP251618 --saveto SRP251618.tsv && cut -f 2 SRP251618.tsv
This list of SRR can then be passed onto (1). You can use snakemake
to take care of parallelization and ensure jobs are rerun if they fail (because of network issues).
Any updates on this?
I tried to optimize this outside of nextflow
first but seems like it takes significantly more time to download as compared to prefetch
of sra-toolkit
. I did not get the time to look into nextflow
option yet but will do at somepoint. pysradb
has worked well for getting the metadata, if the download option is optimized, it may increase the usability significantly.
Thanks for reporting back. I haven't done any benchmarking against prefetch
. Until now I was using a wget
like approach for downloading .sra
files.
With the latest commit on master branch, pysradb
supports multithreaded downloads. This works both for downloading .sra
or directly downloading .fastq.gz
files. Feel free to give it a try and let me know if you have any comments.
Example notebook here: https://colab.research.google.com/drive/1rpQ00uUdaa6evB9QjLxOCUzITcckNTjN?usp=sharing
0.10.4
3.8.3
CentOS Linux
Description
This is a follow-up issue from #46 where i started downloading a batch of sra files for the fetched metadata in a
pandas
DataFrame. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with errorError
self.retrieve() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'
Discussion from #46
As you mentioned
I have checked that
SRR12100406.sra
wasn't created yet. I am not sure how to use parallel efficiently in this case. I have two questionsThanks, Zohaib