saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

Naming of folders with sra files using pysradb #123

Closed sreejata-b closed 2 years ago

sreejata-b commented 3 years ago

Description

I am using a forloop to fetch bioproject sra files and metadata in bash. see code below. it works and fetches the correct data and metadata. i have set up the code so that each folder containing sra files is named to the bioproject id. but for some bioprojects I get weird folder names of bioproject IDs instead of the correct ones, whereas for others they work fine. For example the code below fetched the sra and metadata files for PRJNA474716 and put them in a folder named PRJNA474716, but the same loop fetched the sra files for PRJNA271116 into a folder called PTWEZD~F (which should have been PRJNA271116). is this a bug?

What I Did

for item in $(<forloop.txt)
do 
echo "Starting the fetch"
pysradb metadata --detailed --saveto ${item}.csv $item
pysradb download -t 20 --out-dir $item -p $item
for j in $item/SRP*/SRX*
do
mv ${j}/*.sra $item
for i in $item/*.sra
do
parallel-fastq-dump -s $i --split-3 --threads 16 -O $item --gzip
rm $item/*.sra
done
done
rm -r ${item}/SRP*
done
saketkc commented 3 years ago

Thanks for your question!

I am unable to reproduce this. I tried it on Colab here: https://colab.research.google.com/drive/1dkkPjkTsAZEddH06AO3ogSQNm5hYqp7f?usp=sharing

There is an important caveat of --out-dir: it is simply changing the parent directory in which your project files get written. By default it would write to pysradb_downloads. But in your case, for example with --out-dir PRJNA474716, it will change the parent directory to PRJNA474716 and then further create a directory SRP149820 which corresponds to the SRA project ID inside it. This was designed to store everything at the SRP level, but might not be the best strategy. I would recommend you rename the SRP directory through bash for now. I will try to address this in a future release.

sreejata-b commented 3 years ago

Hi Saket,

Thank you for the response and sorry for delay on my part. I actually am able to get all the fastq files inside the parent folder called PRJAN474716. So I am able to avoid the SRP folder (see code above how I am moving the sra files to the parent directory and then using parallel fastq dump and removing the SRP folders). This works. However, when doing the forloop sometimes the parent directory is not getting named as PRJNA#####, instead its giving names like PTWEZD~F. I will try to rename the parent directory in bash, thanks!