saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
311 stars 51 forks source link

Python API fails to retrieve metadata in some cases #47

Closed sejmodha closed 4 years ago

sejmodha commented 4 years ago

Description

I am trying to extract the metadata using Python API for a number of BioProjects and it works fine for most BioProject accessions except in some cases --detailed=True results in ValueError

What I Did

db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

This results in:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-54-a36d224bed27> in <module>
----> 1 db.sra_metadata('PRJNA389455', expand_sample_attributes=True, detailed=True)

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in sra_metadata(self, srp, sample_attribute, detailed, expand_sample_attributes, output_read_lengths, **kwargs)
    506         metadata_df = metadata_df.drop_duplicates()
    507         metadata_df = metadata_df.replace(r"^\s*$", np.nan, regex=True)
--> 508         ena_results = self.fetch_ena_fastq(srp)
    509         if ena_results.shape[0]:
    510             metadata_df = metadata_df.merge(

~/PhDData/software/miniconda3/envs/dataviz/lib/python3.8/site-packages/pysradb/sraweb.py in fetch_ena_fastq(self, srp)
    149                 srr = srr.split("_")[0]
    150                 if ";" in url1:
--> 151                     url1_1, url1_2 = url1.split(";")
    152                     url1_2 = "http://{}".format(url1_2)
    153                     url2_1, url2_2 = url2.split(";")

ValueError: too many values to unpack (expected 2)
saketkc commented 4 years ago

Thanks for the bug report @sejmodha. I can confirm it fails for this project id. I am looking into this now.

saketkc commented 4 years ago

This is a long shot, but do you know why ENA has multiple files for this SRR:

  1. ftp.sra.ebi.ac.uk/vol1/fastq/SRR568/004/SRR5681734/SRR5681734.fastq.gz
  2. ftp.sra.ebi.ac.uk/vol1/fastq/SRR568/004/SRR5681734/SRR5681734_1.fastq.gz
  3. ftp.sra.ebi.ac.uk/vol1/fastq/SRR568/004/SRR5681734/SRR5681734_2.fastq.gz

I thought 1 would be a merged version of 2+3 (which would not make sense, but still) - which is not the case.

> zcat SRR5681734_1.fastq.gz | head
@SRR5681734.1 1/1
CTAGCGGATGAGCTGTGGATAGGGGTGAAAGGCTAAACAAACTTGGAAATAGCTGGTTCTCTCCGAAAACTATTTAGGTAGTGCCTCAAGT
+
GHFGJJJIJJIIHIIHGIFHGIJJJ?FHHIIJJJJJIJJJIIJJIIHHGHHHE@DFDFEEEEEEDDDDDCD@CEEEDDCCDDCCCCDDDDE
@SRR5681734.2 2/1
TGTCCGGGACGATAATGACGGTACCGGAAGAATAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGGGCTAGCGTTGCT
+
HBADAGGHIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIHGFDDDDDDCDCD?BDDDCDCDDDBBBB>@BEEDDDDDDDDDDDDDDDDDDDD
@SRR5681734.3 3/1
AATGATGTCGATGCGGGGCAGCAACTCTTCGATGGTCAGGCCCAGCTTGTCTCCGCCGGCCTCTACCGTGTTGAGGAATCCGATCGACAGG
zcat SRR5681734.fastq.gz | head
@SRR5681734.52646 52646/1
AGAGCTTGCGACGTCGGGCTTGATCCCGGTGGCCGTAATAACGGAGAAACCAATACAGGTTCGAGAGACGATCTGCCCAGGGTAGA
+
FFHHIIGBDHHIEF@@GGIEGEFGEGIIIHFDEECAAB@DCCBBBBBBCCBBBC@C@CCCCCBBCB@@??-8A<((4?CBB53>?@
@SRR5681734.52647 52647/1
CGCGCAGGCTAAAGCGCTTTTTGGGGTGCTTTTTGAGGTGCTCGTAAATCCGTTGTTCTAGCATGATGTCTTCAGAACGAGGCGCTCCTCG
+
FHHFIG?FGC@FG?CGGIIGEDHI>BFBHIIGIF:;?C.6@AA1>?BBECAB?:??5>@3>@:@C:>@B>CCCCCCCC5)9<>B@B9@??B
@SRR5681734.52648 52648/1
CTTACCTCCAGAGCGAAAGCAGCCGCCATCTGACCTCACCCAGCCGCCTCCGCAAATACGCTGCGGAAATTGAATGTATCAAATCCGCCGA
> zcat SRR5681734_2.fastq.gz | head
@SRR5681734.1 1/2
TCTCCCAAGCTGTACTCATCGGTATTCGGAGTTTGCAATGGTTTGGTAAGTCGCCATGACCCCCTAGCCATAACAGTGCTCTACCCCCGAT
+
HHHHJJJIJJJJJBHJJIJJJJGHIIJJJDGIIIIJGEHIIGHIJJHIJJIGGIGGHHEHFFFFDDEDDDDDDDCDDDDDCDCDDDDBDBD
@SRR5681734.2 2/2
GTTGGCCGCCTTCGCCACTGGTGTTCTTGCGAATATCTACGAATTTCACCTCTACACTCGCAGTTCCACCAACCTCTACCAAACTCAAGCC
+
HHHHJJJIJJJJJJJJIJJJGBGIIJJIIIJJJIIJGIJJJJIHHHHGHFFFDFFCE;@ABDDDDDDEDD?BDDDD@CCDCCBDCDDDCDD
@SRR5681734.3 3/2
AACACCATCTCGGCCCAAACGGCCATGAACTCCATCGACATCGATGTCGGGGGGACCTTTACCGATCTCGTGCTGACCCTGGACGGGGAGC
saketkc commented 4 years ago

The layout is single-end on both SRA and ENA, yet ENA has a paired end version of the fastqs.

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5681734 https://www.ebi.ac.uk/ena/browser/view/SRR5681734

saketkc commented 4 years ago

This has been fixed in master. Thanks once again for reporting. I have also contacted ENA.

https://colab.research.google.com/drive/1THLcuzmW7ESWQbw2hnmb4tHxCGdrJy8n?usp=sharing

sejmodha commented 4 years ago

Thanks for your help!