Open returnOfTheYeti opened 9 months ago
Thank you for the suggestion! While it will take me a while to get to this, we always encourage PRs especially since you already know what is going on in the xml! Let me know if you need any help.
Hello Thanks so much! It is a great tool due to NCBI's limitations on retrieving SRA metadata. One thing I was wondering was whether there was a way to select certain fields? In your docs you demonstrate how to filter using grep, but is there a way to select a specific column of the metadata? What if I just wanted the "run_accession" and the "total_size" ? Thanks RF
On Wed, Feb 14, 2024 at 7:03 AM Saket Choudhary @.***> wrote:
Thank you for the suggestion! While it will take me a while to get to this, we always encourage PRs especially since you already know what is going on in the xml! Let me know if you need any help.
— Reply to this email directly, view it on GitHub https://github.com/saketkc/pysradb/issues/210#issuecomment-1944016591, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICZAJ6KTI5F3CJW4JHRFYLYTTG5ZAVCNFSM6AAAAABDFOIILGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBUGAYTMNJZGE . You are receiving this because you authored the thread.Message ID: @.***>
if you don't want to use grep, maybe you can create a python script to select the fields you like, for example:
from pysradb import SRAweb
[...]
srp = 'SRP481544'
raw_pysradb_data_frame = SRAweb().srp_to_srr(srp)
srrs = list(raw_pysradb_data_frame['run_accession']) # Here you can put the field you want
hi @returnOfTheYeti, I would go with @arcones' recommendation here.
Hello Thank you for your response. My initial problem was that there are multiple fields missing in the output, compared to what is actually listed on SRA. For example, here you see one important field "gisaid_accession", listed in the link below: https://www.ncbi.nlm.nih.gov/Traces/study/?query_key=2&WebEnv=MCID_65df8dedd6fce424fe3cff83&o=acc_s%3Aa But for the headers in pysradb using the command: srrs = list(raw_pysradb_data_frame), you get: ['study_accession', 'run_accession', 'study_title', 'experiment_accession', 'experiment_title', 'experiment_desc', 'organism_taxid', 'organism_name', 'library_name', 'library_strategy', 'library_source', 'library_selection', 'library_layout', 'sample_accession', 'sample_title', 'biosample', 'bioproject', 'instrument', 'instrument_model', 'instrument_model_desc', 'total_spots', 'total_size', 'run_total_spots', 'run_total_bases']
This is one field that is missing, but there are multiple fields that are missing from other samples. My question remains: How do I go about retrieving the "gisaid_accession" from a sample? Is it not possible? Thanks again
On Wed, Feb 21, 2024 at 7:35 PM Saket Choudhary @.***> wrote:
hi @returnOfTheYeti https://github.com/returnOfTheYeti, I would go with @arcones https://github.com/arcones' recommendation here.
— Reply to this email directly, view it on GitHub https://github.com/saketkc/pysradb/issues/210#issuecomment-1958595204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AICZAJ3OF7C6K5F425E6P3TYU24HFAVCNFSM6AAAAABDFOIILGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYGU4TKMRQGQ . You are receiving this because you were mentioned.Message ID: @.***>
These are not standard fields that are defined for each project and hence currently not supported.
In the SRA db, in the run info, as well as in the XML, one can see variables such as "GISAID_Accession" and "SARS-CoV-2_diagnostic_pcr_Ct_value_1" for certain samples (below).
https://www.ncbi.nlm.nih.gov/sra/?term=SRR15168846
But when I extract the detailed data for this sample using:
pysradb metadata SRR15168847 --detailed | head
these attributes mentioned above, are missing from the pysradb output. Is there any way to retrieve ALL of the metadata? Or at least, specific attributes that are not included in the "detailed" setting?
I downloaded pysradb on Feb 12, 2024 via conda