pepkit / geofetch

Builds a PEP from SRA or GEO accessions
https://pep.databio.org/geofetch/
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Downloading the supplementary files #138

Closed Zethson closed 2 weeks ago

Zethson commented 2 weeks ago

Dear developers,

I've been playing around with the queries for a while now and looked at the tutorials including https://pep.databio.org/geofetch/code/processed-data-downloading/ and https://pep.databio.org/geofetch/code/raw-data-downloading/ . However, there's two things that I can't seem to figure out:

  1. How can I download the raw counts as specified in the R script in https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE139940 ? I couldn't even find the dataset in the UI, only the script.
  2. How do I download a single specific file from the list of supplementary files of https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE139940 ? In this case GSE139940_180821_lupus_RNA-seq_results.gene.rpkm.xlsx.

Sorry for the user questions.

Thank you very much!

khoroshevskyi commented 2 weeks ago

Hello, thank you for your question.

  1. This is a new feature of GEO. It seems that it was added to GEO a year ago. Unfortunately, by looking closely into this issue, I couldn't find any metadata about raw counts, or a link to it, using the provided metadata within the GEO API. As a result, geofetch can't handle this specific GEO feature (GEO2R) 😢.

The power of geofetch is in saving all sample metadata in a nicely combined CSV file, along with downloaded processed or raw files (SRA). The GEO database consists of two main units: Projects (GSE) and Samples (GSM). Both of these units can have files, and geofetch can work in a sample-centric (GSM) way as well as in a project-centric (GSE) way. That's why users should specify what data or metadata they are interested in. In your case, it is project-level files.

Additionally, geofetch has a regex filter available for files, so you can easily download data with specific naming patterns or formats. In your case, the CLI command will look like this:

geofetch -i GSE139940  --processed --data-source series --filter GSE139940_180821_lupus_RNA-seq_results.gene.rpkm.xlsx --geo-folder .
  1. We want to download processed data: --processed
  2. The data source is a Project (series or GSE): --data-source series
  3. Use filter as regex: --filter GSE139940_180821_lupus_RNA-seq_results.gene.rpkm.xlsx
  4. Specify the downloads folder: --geo-folder .

Let me know if this command worked for you and if you have any other questions.

Zethson commented 2 weeks ago

Thank you, this did the trick!