saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

Issue with fetching and downloading metadata #116

Closed sreejata-b closed 3 years ago

sreejata-b commented 3 years ago

Description

I am trying to execute the command, SRP_list = """PRJNA266662 PRJEB15194 PRJNA494717 PRJNA395393 PRJEB15090""" SRP_list = SRP_list.split('\n'), to download the metadata associated with all these bioprojects at once. I am looking at this page here https://colab.research.google.com/github/saketkc/pysradb/blob/master/notebooks/07.Multiple_SRPs.ipynb.

I have made a new pysradb environment as is suggested. However, I am getting this error, when using the above command, bash: syntax error near unexpected token `(' . So i changed it to SRP_list = """PRJNA266662 PRJEB15194 PRJNA494717 PRJNA395393 PRJEB15090""" SRP_list = SRP_list.split('\n') and now I am getting the error, SRP_list: command not found

I think I am writing the code wrong, I would appreciate any help.

I am also having a similar issue with this code when trying to download sra files for multiple projects, says "from" command not found: (pysradb) (base) -bash-4.2$ pip install joblib Requirement already satisfied: joblib in /mnt/ufs18/rs-033/ShadeLab/WorkingSpace/Bandopadhyay_WorkingSpace/metaanalysis_doi/pysradb/lib/python3.7/site-packages (1.0.0) (pysradb) (base) -bash-4.2$ from joblib import Parallel, delayed -bash: from: command not found

What I Did

(pysradb) (base) -bash-4.2$ SRP_list = """PRJNA266662 PRJEB15194 PRJNA494717 PRJNA395393 PRJEB15090""" SRP_list = SRP_list.split\('\n'\)
-bash: SRP_list: command not found

(pysradb) (base) -bash-4.2$ pip install joblib
Requirement already satisfied: joblib in /mnt/ufs18/rs-033/ShadeLab/WorkingSpace/Bandopadhyay_WorkingSpace/metaanalysis_doi/pysradb/lib/python3.7/site-packages (1.0.0)
(pysradb) (base) -bash-4.2$ from joblib import Parallel, delayed
-bash: from: command not found
saketkc commented 3 years ago

Hi @sreejata-b,

The notebooks demonstrate the usage of Python API, something that you would use if you wanted to do inside python. Your code suggests you are trying to use the command line. In this case, you would do it something like this:

$ pysradb metadata --detailed PRJNA266662 PRJEB15194 PRJNA494717 PRJNA395393 PRJEB15090

This would print a lot of data on the terminal, which you can save to a file using --saveto metadata.tsv:

$ pysradb metadata --detailed --saveto metadata.tsv PRJNA266662 PRJEB15194 PRJNA494717 PRJNA395393 PRJEB15090 

If you want to see other examples for how to use the command line to retrieve metadata or download data, I would recommend reading this page. If you would to have more control and do it inside Python, this page might be more useful. Finally, this page shows how to use pysradb in bash and Python side by side.

joblib is required only if you want to acheieve parallel downloads. In most cases the download is I/O bound, so parallelizing might not lead to a speedup.

I hope that helps. Let me know if you have any other questions.

sreejata-b commented 3 years ago

Hi Saket, Thanks for that. This is helpful and the code worked. However, I realized that this code put all the metadata in the same metadata.tsv file. Is it possible to make folders for each Bioproject and then save the metadata within those? The latter is more helpful for me to keep things separate. Also I do have several studies for example ~200 , those cases how best to automate so i can for instance make a .csv file with all bioproject ids and then feed that into the code and it fetches all of them one by one in separate folders. is that possible?

saketkc commented 3 years ago

You can specify one project instead of multiple projects and use a bash loop to loop over all projects one by one.

Hope that helps!