saketkc / pysradb

Package for fetching metadata and downloading data from SRA/ENA/GEO
https://saketkc.github.io/pysradb
BSD 3-Clause "New" or "Revised" License
307 stars 50 forks source link

[BUG] Duplicated metadata when querying metadata for single run accession #89

Closed kpj closed 1 year ago

kpj commented 3 years ago

Describe the bug In some cases, when using SRAweb.sra_metadata with a single run accession, multiple metadata rows are returned. It would seem more sensible to only return the metadata for the requested run accession. This is e.g. problematic when retrieving metadata for a list of samples and expecting the number of rows to be equal to the number of queried samples.

To Reproduce Execute the following code:

>>> from pysradb.sraweb import SRAweb

>>> db = SRAweb()
>>> db.sra_metadata('SRR12169246', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]
>>> db.sra_metadata('SRR12169247', detailed=True)  # returns metadata for both SRR12169246 and SRR12169247
#   run_accession study_accession experiment_accession  ...                                                                       ena_fastq_ftp ena_fastq_ftp_1 ena_fastq_ftp_2
# 0  SRR12169247   SRP270837       SRX8684079           ...  era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR121/047/SRR12169247/SRR12169247.fastq.gz  N/A             N/A           
# 1  SRR12169246   SRP270837       SRX8684079           ...  era-fasp@fasp.sra.ebi.ac.uk:vol1/fastq/SRR121/046/SRR12169246/SRR12169246.fastq.gz  N/A             N/A           

[2 rows x 32 columns]

Desktop:

saketkc commented 3 years ago

Thanks for the bug report @kpj! I think the reason this bug results in two runs is because that happens when you also search it via the NCBI-SRA website. For example see: https://www.ncbi.nlm.nih.gov/sra/?term=SRR12169246 That said, it can be handled internally - I will get to it this week.

kpj commented 3 years ago

Thanks! I came across a similar issue when fetching metadata manually and ended up subsetting the dataframe.

Maybe there's a better of way of handling this.

saketkc commented 3 years ago

For now, I would recommend the fix you have in place. It is slightly tricky to deal this internally given the passed in argument could be anything (SRP/SRR/SRX/GSM etc.). The origin of this is not at pysradb end, but what NCBI search itself returns (see above comment)

kpj commented 3 years ago

Is the main issue to figure out which column to detect duplicates in/which column to select the accessions from? In that case it might be an idea to add a parameter such as duplicate_accession_removal_column which would be run_accession when input accessions are of the form ERR4413803.

This is certainly not very elegant and maybe there are other issues making this more difficult, so I am happy either way :)

fatyang799 commented 1 year ago

I met the same question. And I am confused about the relationship between multiple SRR IDs within a single SRX ID. Are these SRR IDs technical replicates from a shared sequencing library? The manual in NCBI made me really confused. And I would appreciate it if you could tell me your understanding of this question.

saketkc commented 1 year ago

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

fatyang799 commented 1 year ago

Yes, SRRs for the same SRX are technical replicates. Here are some slides that might help: https://f1000research.com/slides/8-1183

Many thanks for your quick reply!!

In passing, I would like to raise here another problem that I encountered in the course of using. The metadata I prefetch by pysradb metadata --detailed do not include some important info.

For example, I want to acquire antibody info of a ChIPseq ([SRX027872](https://www.ncbi.nlm.nih.gov/sra/SRX027872%5Baccn%5D)). On the web of NCBI, I can see the antibody info (Experiment attributes part). But there is no related info in metadata I prefetch by pysradb.

saketkc commented 1 year ago

@sheep-liu thanks for brining it to my attention. I have pushed https://github.com/saketkc/pysradb/commit/7da562f86fe759f737b25f6581a8c44a9437b5b4 which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

fatyang799 commented 1 year ago

@sheep-liu thanks for brining it to my attention. I have pushed 7da562f which enables fetching experiment protocol. It will be in the next release (you can install the develop version from github for now).

For future, please create a new issue. I will close this for now as I think the original issue it is best handled downstream.

Roger! And thanks a lot.