saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
311 stars 51 forks source link

Error during batch downloading SRA files using SRAweb() #48

Closed anwarMZ closed 4 years ago

anwarMZ commented 4 years ago

Description

This is a follow-up issue from #46 where i started downloading a batch of sra files for the fetched metadata in a pandas DataFrame. I used this example mentioned here in ipynb. I am running this script as a job on Sun GridEngine based cluster and script ended with error

Error

self.retrieve() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result return self.__get_result() File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result raise self._exception FileNotFoundError: [Errno 2] No such file or directory: '/projects/NCBI_seqdata/pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part'

Discussion from #46

The download method first downloads to a temporary location which in this case is pysradb_downloads/SRP251618/SRX8624823/SRR12100406.sra.part: notice the .part. Downloads are resumable by default. Once a download finishes, the .part extension is removed to mark it complete.

In this case the error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded (it thinks it hasn't been, but probably its download is already complete).

You should have SRR12100406.sra Please feel free to open a new issue otherwise.

As you mentioned

The error you get seems to likely be arising because the parallel module is getting confused if this particular file has already been downloaded

I have checked that SRR12100406.sra wasn't created yet. I am not sure how to use parallel efficiently in this case. I have two questions

  1. If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from?
  2. If you have any opinion with using example mentioned here on SunGridEngine based job queue system?

Thanks, Zohaib

anwarMZ commented 4 years ago

To add to this, running the script again caught different exception -

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 724, in urlopen
    retries = retries.increment(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 1332, in getresponse
    response.begin()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 303, in begin
    version, status, reason = self._read_status()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/http/client.py", line 272, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 431, in _process_worker
    r = call_item()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 285, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in __call__
    return self.func(*args, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in __call__
    return [func(*args, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 252, in <listcomp>
    return [func(*args, **kwargs)
  File "/projects/test/parallel_download_pysradb.py", line 8, in single_download
    db.download(df=df_single, skip_confirmation=True)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/sradb.py", line 1318, in download
    file_sizes = df.apply(get_file_size, axis=1)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/frame.py", line 6878, in apply
    return op.get_result()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pandas/core/apply.py", line 295, in apply_standard
    result = libreduction.compute_reduction(
  File "pandas/_libs/reduction.pyx", line 618, in pandas._libs.reduction.compute_reduction
  File "pandas/_libs/reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/pysradb/download.py", line 54, in get_file_size
    return float(requests.head(url).headers["content-length"])
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 104, in head
    return request('head', url, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/projects/test/parallel_download_pysradb.py", line 34, in <module>
    Parallel(n_jobs=jobs)(
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 1042, in __call__
    self.retrieve()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/parallel.py", line 921, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/zohaib/.conda/envs/pysradb/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Appreciate your help with this. Thanks, Zohaib

saketkc commented 4 years ago

If i run the script again does it in anyways check which ones are dowloaded already and skip them ? or lets say resume from where to start from? It resumes downloads as long as x.sra.part file exists. In the error you posted it seems it doesn't. I am not able to replicate this at my end unfortunately. So I am not sure how to help. If you have any opinion with using example mentioned here on SunGridEngine based job queue system? You will have to write a custom script to take a SRP, split it into subsets and submit the subset dataframe to pysradb download. For example if you want to download only one SRR:

pysradb metadata SRR12100406 --detailed | pysradb download (1)

You can get a list of SRRs using:

pysradb srp-to-srr SRP251618 --saveto SRP251618.tsv && cut -f 2 SRP251618.tsv

This list of SRR can then be passed onto (1). You can use snakemake to take care of parallelization and ensure jobs are rerun if they fail (because of network issues).

saketkc commented 4 years ago

Any updates on this?

anwarMZ commented 4 years ago

I tried to optimize this outside of nextflow first but seems like it takes significantly more time to download as compared to prefetch of sra-toolkit. I did not get the time to look into nextflow option yet but will do at somepoint. pysradb has worked well for getting the metadata, if the download option is optimized, it may increase the usability significantly.

saketkc commented 4 years ago

Thanks for reporting back. I haven't done any benchmarking against prefetch. Until now I was using a wget like approach for downloading .sra files.

With the latest commit on master branch, pysradb supports multithreaded downloads. This works both for downloading .sra or directly downloading .fastq.gz files. Feel free to give it a try and let me know if you have any comments.

Example notebook here: https://colab.research.google.com/drive/1rpQ00uUdaa6evB9QjLxOCUzITcckNTjN?usp=sharing