saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
313 stars 51 forks source link

[ENH] Include data processing steps, reference to which the reads were aligned or if possible lab protocol into the main table #188

Open ajandria opened 1 year ago

ajandria commented 1 year ago

Is your feature request related to a problem? Please describe.

I was wondering whether it is possible to also retrieve data processing description that is present in the sample's records in GEO. See here for an example: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM6005004 - there is a lot of information that we would like to see in the table that pysradb generates:

Status
Title
Sample type
Source name
Organism
Characteristics
Treatment protocol
Growth protocol
Extracted molecule
Extraction protocol
Library strategy
Library source
Library selection
Instrument model
Description
Data processing

Describe the solution you'd like

I like the table that is currently generated using the following: df = db.sra_metadata(df["study_accession"], detailed = True, expand_sample_attributes = True, output_read_lengths = True) although I feel like it is missing sometimes crucial information that is only included in GEO under specific records of the samples. For an example it the record of the sample that I have included above you can find the following:

Sequenced reads were trimmed for adaptor sequence and low-quality sequence (bbduk; minlength=30, qtrim=rl, trimq=15)
Reads were then mapped to the reference genome of Mus musculus (GRCm38) using STAR aligner version 2.5.3a with parameters --quantMode GeneCounts --runThreadN 4
Assembly: GRCm38

It would be nice to get that into the sra_metadata table too if that is possible. I guess for now I could just use geoquery for that and then merge two tables if possible by GSM sample ids, although I would need to test that. Then probably the hustle including this here would be redundant. But still it seems like a nice direction that one could take to expand this :)

Thank you for your work so far!

saketkc commented 1 year ago

Thanks, this is a great suggestion! It is doable - once the experiment_alias is fetched pysradb would need to make another request for the corresponding detailed GEO metadata. I currently do not have the bandwidth to do this, but pull requests are always welcome!