saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

Skipped SRX entries? #125

Closed romaingroux closed 3 years ago

romaingroux commented 3 years ago

Hello,

First of all, I would like to thank you for this package which really does a good job from what I used it.

Second, I don't know whether what I am describing here is i) a pysradb bug, ii) coming from the SRA database itself or iii) is a missunderstanding from my side about the SRA (in which case I apologize already).

Describe the bug I am trying to obtain all the metadata for SRP043609. This project contains two SRX entries : SRX638310 and SRX627421 (according to https://www.ncbi.nlm.nih.gov/sra/?term=SRP043609).

My expectations are to retrieve metadata for both SRX entries in the tables returned by the different commands of pysradb.

I can retrieve all the expected information for SRX638310. However, I cannot get anything for SRX627421.

To Reproduce I ran all possible commands :

pysradb --srp-to-srx SRP043609
pysradb --srp-to-srs SRP043609
pysradb --srp-to-srr SRP043609
...

None of them returns a single line containing SRX627421. This is also true if you take an SRX627421 SRR number, say SRR1947646, and run:

pysradb --srr-to-srp SRR1947646
pysradb --srr-to-srs SRR1947646
...

Desktop (please complete the following information):

saketkc commented 3 years ago

Hi @romaingroux, thanks for the bug report. I can confirm this is a bug given SRA's webpage does return SRX627421. Though I am not sure of the origin - given me a couple of days to address this. Apologies.

romaingroux commented 3 years ago

No problem! Thank you for all :)

saketkc commented 3 years ago

I looked into this. It seems, the runs slot for this SRX627421:

<Summary><Title>NA12878 P5-C3 Sequencing</Title><Platform instrument_model="PacBio RS">PACBIO_SMRT</Platform><Statistics total_runs="110" total_spots="24522300" total_bases="181073101227" total_size="630384719154" load_done="true" cluster_name="public"/></Summary><Submitter acc="SRA172711" center_name="PacBio" contact_name="Ali Bashir" lab_name=""/><Experiment acc="SRX627421" ver="6" status="public" name="NA12878 P5-C3 Sequencing"/><Study acc="SRP043609" name="Homo sapiens Genome sequencing and assembly"/><Organism taxid="9606" ScientificName="Homo sapiens"/><Sample acc="SRS647947" name=""/><Instrument PACBIO_SMRT="PacBio RS"/><Library_descriptor><LIBRARY_NAME>PB-C3</LIBRARY_NAME><LIBRARY_STRATEGY>WGS</LIBRARY_STRATEGY><LIBRARY_SOURCE>GENOMIC</LIBRARY_SOURCE><LIBRARY_SELECTION>RANDOM</LIBRARY_SELECTION><LIBRARY_LAYOUT>                 <SINGLE/>               </LIBRARY_LAYOUT></Library_descriptor><Bioproject>PRJNA253696</Bioproject><Biosample>SAMN02887092</Biosample>  ', 'runs': '',

I am not quite sure why that is the case, but from a second look it does not look like a pysradb bug.

romaingroux commented 3 years ago

Ok no problem. You can close the issue I guess?

Thanks for having a look :)

saketkc commented 3 years ago

Thanks!