saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
313 stars 51 forks source link

Not all attributes are being exported #210

Open returnOfTheYeti opened 9 months ago

returnOfTheYeti commented 9 months ago

In the SRA db, in the run info, as well as in the XML, one can see variables such as "GISAID_Accession" and "SARS-CoV-2_diagnostic_pcr_Ct_value_1" for certain samples (below).

https://www.ncbi.nlm.nih.gov/sra/?term=SRR15168846

But when I extract the detailed data for this sample using:

pysradb metadata SRR15168847 --detailed | head

these attributes mentioned above, are missing from the pysradb output. Is there any way to retrieve ALL of the metadata? Or at least, specific attributes that are not included in the "detailed" setting?

I downloaded pysradb on Feb 12, 2024 via conda

saketkc commented 9 months ago

Thank you for the suggestion! While it will take me a while to get to this, we always encourage PRs especially since you already know what is going on in the xml! Let me know if you need any help.

returnOfTheYeti commented 9 months ago

Hello Thanks so much! It is a great tool due to NCBI's limitations on retrieving SRA metadata. One thing I was wondering was whether there was a way to select certain fields? In your docs you demonstrate how to filter using grep, but is there a way to select a specific column of the metadata? What if I just wanted the "run_accession" and the "total_size" ? Thanks RF

On Wed, Feb 14, 2024 at 7:03 AM Saket Choudhary @.***> wrote:

Thank you for the suggestion! While it will take me a while to get to this, we always encourage PRs especially since you already know what is going on in the xml! Let me know if you need any help.

— Reply to this email directly, view it on GitHub https://github.com/saketkc/pysradb/issues/210#issuecomment-1944016591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICZAJ6KTI5F3CJW4JHRFYLYTTG5ZAVCNFSM6AAAAABDFOIILGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGAYTMNJZGE . You are receiving this because you authored the thread.Message ID: @.***>

arcones commented 9 months ago

if you don't want to use grep, maybe you can create a python script to select the fields you like, for example:

from pysradb import SRAweb

[...]
srp = 'SRP481544'
raw_pysradb_data_frame = SRAweb().srp_to_srr(srp)
srrs = list(raw_pysradb_data_frame['run_accession']) # Here you can put the field you want
saketkc commented 9 months ago

hi @returnOfTheYeti, I would go with @arcones' recommendation here.

returnOfTheYeti commented 9 months ago

Hello Thank you for your response. My initial problem was that there are multiple fields missing in the output, compared to what is actually listed on SRA. For example, here you see one important field "gisaid_accession", listed in the link below: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=2&WebEnv=MCID_65df8dedd6fce424fe3cff83&o=acc_s%3Aa But for the headers in pysradb using the command: srrs = list(raw_pysradb_data_frame), you get: ['study_accession', 'run_accession', 'study_title', 'experiment_accession', 'experiment_title', 'experiment_desc', 'organism_taxid', 'organism_name', 'library_name', 'library_strategy', 'library_source', 'library_selection', 'library_layout', 'sample_accession', 'sample_title', 'biosample', 'bioproject', 'instrument', 'instrument_model', 'instrument_model_desc', 'total_spots', 'total_size', 'run_total_spots', 'run_total_bases']

This is one field that is missing, but there are multiple fields that are missing from other samples. My question remains: How do I go about retrieving the "gisaid_accession" from a sample? Is it not possible? Thanks again

On Wed, Feb 21, 2024 at 7:35 PM Saket Choudhary @.***> wrote:

hi @returnOfTheYeti https://github.com/returnOfTheYeti, I would go with @arcones https://github.com/arcones' recommendation here.

— Reply to this email directly, view it on GitHub https://github.com/saketkc/pysradb/issues/210#issuecomment-1958595204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICZAJ3OF7C6K5F425E6P3TYU24HFAVCNFSM6AAAAABDFOIILGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYGU4TKMRQGQ . You are receiving this because you were mentioned.Message ID: @.***>

saketkc commented 9 months ago

These are not standard fields that are defined for each project and hence currently not supported.