saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
303 stars 49 forks source link

[BUG] gse-to-srp not producing results #186

Closed forrest1988 closed 1 year ago

forrest1988 commented 1 year ago

Hi,

Every now and then, when I am given "GSE" accession number and I wish to convert it to "SRP", I receive with empty output, this for example happened to me Today for GSE209835:

(pysradb) [xxx@xxx]$ pysradb gse-to-srp GSE209835
(pysradb) [xxx@xxx]$ 

while usually, this command should produce output like this:

(pysradb) [xxx@xxx]$ pysradb gse-to-srp GSE168880
study_alias     study_accession
GSE168880 SRP310566
(pysradb) [xxx@xxx]$ 

The problematic nature of this issue is that If only one GSE is being converted at a time, one can immediately see that it was not converted. However, when say dozens of GSE numbers are simultaneously converted, then the one that is "missing" is simply missing, and its sometimes hard to spot it (one would have to check if the number of accession numbers in query and output is matching and if not, then which one is missing. E.g. this command shows this behavior:

(pysradb) [xxx@xxx]$ pysradb gse-to-srp GSE168880 GSE209835
study_alias     study_accession
GSE168880 SRP310566
(pysradb) [xxx@xxx]$ 

At the same time, its not that those metadata are not at all in the database, because if I for example find SRP number, which for the example GSE209835 is SRP388275, then I can find other types of metadata, e.g.:

(pysradb) [xxx@xxx]$ pysradb search --query 'SRP388275'
100%|██████| 14/14 [00:00<00:00, 15.85it/s]
study_accession experiment_accession    experiment_title        sample_taxon_id sample_scientific_name  experiment_library_strategy     experiment_library_source       experiment_library_selection    sample_accession  sample_alias    experiment_instrument_model     pool_member_spots       run_1_size      run_1_accession run_1_total_spots       run_1_total_bases
SRP388275 SRX16679945 GSM6401820: Reg_C3_HA_U15; Homo sapiens; ATAC-seq 9606 Homo sapiens ATAC-seq GENOMIC other SRS14311220 GSM6401820 NextSeq 2000    80552730 3248389842 SRR20656945 80552730 10471854900
SRP388275 SRX16679944 GSM6401819: Reg_C2_HA_U11; Homo sapiens; ATAC-seq 9606 Homo sapiens ATAC-seq GENOMIC other SRS14311219 GSM6401819 NextSeq 2000    79081253 3179004752 SRR20656946 79081253 10280562890
SRP388275 SRX16679943   GSM6401818: Reg_HA_U7; Homo sapiens; ATAC-seq 9606 Homo sapiens ATAC-seq GENOMIC other SRS14311218 GSM6401818 NextSeq 2000      90884401 3570075486 SRR20656947 90884401 11814972130
...

Moreover, if I instead run this type of command pysradb search --query 'SRP388275' --detailed | grep "GSE209835" I can see that the GSE209835 is "hidden" in the metadata (at least the extended fields).

Next steps

If you have any idea how this could be fixed / fine-tuned, that would be awesome! If for example it could not be "easily" fixed e.g. because how the reference database is constructed, then I would suggest to update the display of the results to something like this:

(pysradb) [xxx@xxx]$ pysradb gse-to-srp GSE168880 GSE209835
study_alias     study_accession
GSE168880 SRP310566
GSE209835 NA
(pysradb) [xxx@xxx]$ 

This kind of output format would simplify post-processing and mitigate omitting of the datasets for bulk analyses.

Desktop (please complete the following information):

All the best, Wojciech

PS. Once again thank you for your amazing tool, it elevates all kinds of work related with reprocessing of public data to another level!

saketkc commented 1 year ago

thanks for reporting this! The latest commit (above) in develop should address this for srp-to-gse command. I will have to see if there are any other commands that run into issues because of such edge cases.