saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
311 stars 51 forks source link

SRAdb and SRAweb don't give the same results #15

Closed dweemx closed 4 years ago

dweemx commented 4 years ago

Hi,

First I'd like to thank you for this very useful package. Unfortunely, I'd love to use SRAweb, unfortunately, there seems to be somthing wrong with it compared to SRAdb.

Here are my specs,

Description

I'm trying to get the metadata from a SRA project ID (e.g.: SRP125768).

What I Did

With local SQL db,

db = SRAdb('SRAmetadb.sqlite')
df1 = db.sra_metadata('SRP125768', detailed=True, expand_sample_attributes=True, sample_attribute=True)

image

W/o local SQL db,

db = SRAweb()
df2 = db.sra_metadata(srp="SRP125768", detailed=True, expand_sample_attributes=True, sample_attribute=True)

image

I haven't check all the entries but there is definitely something wrong with df2: duplicated rows / missing rows.

I'd be happy to get your feedback and your fix for this :)

saketkc commented 4 years ago

Thanks for the bug report @dweemx. From a first look, I can confirm this is indeed a bug. I will revert with a possible solution/explanation shortly.

saketkc commented 4 years ago

HI @dweemx, It looks like the origin of this bug is at the NCBI's search interface. Looking up SRP125768 on https://www.ncbi.nlm.nih.gov/sra only shows up 128 hits while the total hits clearly should be 136 (corresponding to the total runs). These are the missing run ids:

'SRR6327103', 'SRR6327106', 'SRR6327114', 'SRR6327120', 'SRR6327118', 'SRR6327122', 'SRR6327135', 'SRR6327116'

I will have to look for a way to ensure such runs are not missed. Thanks once again for reporting this.

dweemx commented 4 years ago

Hi, I contacted the SRA team and they told me that there was an issue with the SRA file pairing system when the data was ported from GEO to SRA database. This issue should be fixed now.

However, some samples are still missing when I'm using SRAweb: 'SRR6327106', 'SRR6327114', 'SRR6327120', 'SRR6327118', 'SRR6327122', 'SRR6327116'

saketkc commented 4 years ago

Thanks for the update @dweemx. It seems https://www.ncbi.nlm.nih.gov/sra/?term=SRP125768 still sends only 128 results. I will have time to work on a way to fix this in the coming few weeks. Thanks for your patience and sorry for the trouble this has been causing you.

saketkc commented 4 years ago

Hi @dweemx Thanks for your patience. I was finally able to fix this in v0.9.9. See this notebook for example with this ID: https://colab.research.google.com/drive/1C60V-jkcNZiaCra_V5iEyFs318jgVoUR

The web mode's default --detailed output gives all the metadata you see on SRA's run table.

> pysradb metadata SRP125768 --detailed | head
study_accession experiment_accession experiment_title                                                                                    experiment_desc                                                                                     organism_taxid  organism_name            library_strategy library_source  library_selection sample_accession sample_title instrument           total_spots total_size   run_accession run_total_spots run_total_bases run_alias      experiment_alias source_name                                      age        genotype/variation          tissue genotype 
 SRP125768       SRX4084637           GSM3142622: w1118_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq                GSM3142622: w1118_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq                7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301695                    NextSeq 500          3552575     79516196     SRR7166639    3552575         176271295       GSM3142622_r1  GSM3142622       w1118_1d_WholeBrain_Unstranded_RNA-seq           1 Day      W[1118]                     brain  NaN     
 SRP125768       SRX4084636           GSM3142621: w1118_1d_WholeBrain_Stranded_RNA-seq; Drosophila melanogaster; RNA-Seq                  GSM3142621: w1118_1d_WholeBrain_Stranded_RNA-seq; Drosophila melanogaster; RNA-Seq                  7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301693                    NextSeq 500          4513696     100655283    SRR7166638    4513696         220693988       GSM3142621_r1  GSM3142621       w1118_1d_WholeBrain_Stranded_RNA-seq             1 Day      W[1118]                     brain  NaN     
 SRP125768       SRX4084635           GSM3142620: DGRP-551_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq             GSM3142620: DGRP-551_1d_WholeBrain_Unstranded_RNA-seq; Drosophila melanogaster; RNA-Seq             7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301694                    NextSeq 500          19374029    433332434    SRR7166637    19374029        961111968       GSM3142620_r1  GSM3142620       DGRP-551_1d_WholeBrain_Unstranded_RNA-seq        1 Day      DGRP-551                    brain  NaN     
 SRP125768       SRX4084634           GSM3142619: DGRP-551_1d_WholeBrain_Stranded_RNA-seq; Drosophila melanogaster; RNA-Seq               GSM3142619: DGRP-551_1d_WholeBrain_Stranded_RNA-seq; Drosophila melanogaster; RNA-Seq               7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301692                    NextSeq 500          2936449     65552609     SRR7166636    2936449         145074237       GSM3142619_r1  GSM3142619       DGRP-551_1d_WholeBrain_Stranded_RNA-seq          1 Day      DGRP-551                    brain  NaN     
 SRP125768       SRX4084633           GSM3142618: DGRP-551_1d_WholeBrainNuclei_Unstranded_Rep2_RNA-seq; Drosophila melanogaster; RNA-Seq  GSM3142618: DGRP-551_1d_WholeBrainNuclei_Unstranded_Rep2_RNA-seq; Drosophila melanogaster; RNA-Seq  7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301691                    NextSeq 500          24342212    458751469    SRR7166635    24342212        1207043823      GSM3142618_r1  GSM3142618       DGRP-551_1d_WholeBrainNuclei_Unstranded_RNA-seq  1 Day      DGRP-551                    brain  NaN     
 SRP125768       SRX4084632           GSM3142617: DGRP-551_1d_WholeBrainNuclei_Unstranded_Rep1_RNA-seq; Drosophila melanogaster; RNA-Seq  GSM3142617: DGRP-551_1d_WholeBrainNuclei_Unstranded_Rep1_RNA-seq; Drosophila melanogaster; RNA-Seq  7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301696                    Illumina HiSeq 4000  7398351     236600904    SRR7166634    7398351         551705108       GSM3142617_r1  GSM3142617       DGRP-551_1d_WholeBrainNuclei_Unstranded_RNA-seq  1 Day      DGRP-551                    brain  NaN     
 SRP125768       SRX4084631           GSM3142616: Adapted_SMART_seq2_R23E10_Cell_9; Drosophila melanogaster; RNA-Seq                      GSM3142616: Adapted_SMART_seq2_R23E10_Cell_9; Drosophila melanogaster; RNA-Seq                      7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301688                    NextSeq 500          267487      6409898      SRR7166633    267487          13266487        GSM3142616_r1  GSM3142616       Adapted_SMART_seq2_R23E10_Cell                   0-7 Days   R23E10-Gal4 x UAS-CD8::GFP  brain  NaN     
 SRP125768       SRX4084630           GSM3142615: Adapted_SMART_seq2_R23E10_Cell_8; Drosophila melanogaster; RNA-Seq                      GSM3142615: Adapted_SMART_seq2_R23E10_Cell_8; Drosophila melanogaster; RNA-Seq                      7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301690                    NextSeq 500          192550      4678011      SRR7166632    192550          9548043         GSM3142615_r1  GSM3142615       Adapted_SMART_seq2_R23E10_Cell                   0-7 Days   R23E10-Gal4 x UAS-CD8::GFP  brain  NaN     
 SRP125768       SRX4084629           GSM3142614: Adapted_SMART_seq2_R23E10_Cell_7; Drosophila melanogaster; RNA-Seq                      GSM3142614: Adapted_SMART_seq2_R23E10_Cell_7; Drosophila melanogaster; RNA-Seq                      7227            Drosophila melanogaster  RNA-Seq          TRANSCRIPTOMIC  cDNA              SRS3301689                    NextSeq 500          199223      4833365      SRR7166631    199223          9885888         GSM3142614_r1  GSM3142614       Adapted_SMART_seq2_R23E10_Cell                   0-7 Days   R23E10-Gal4 x UAS-CD8::GFP  brain  NaN   

Please let me know if you run into any issues.