Add code, URLs, screenshots etc to "Simon’s half-baked way of looking for relevant metagenomes in SRA"

turbomam commented 6 months ago

Simon’s half-baked way of looking for relevant metagenomes in SRA

turbomam commented 6 months ago

First character:

S (not error or duplicate)
Error
Duplicate

Second character is always R

Third character:

Run
P (study)
R (Submission)
eXperiment
Sample

turbomam commented 6 months ago

https://www.bioconductor.org/packages/release/bioc/html/SRAdb.html

curl --output SRAmetadb.sqlite.gz  https://gbnci.cancer.gov/backup/SRAmetadb.sqlite.gz

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6865M 100 6865M 0 0 20.9M 0 0:05:27 0:05:27 --:--:-- 23.0M

turbomam commented 6 months ago

gunzip SRAmetadb.sqlite.gz
ls -lh SRAmetadb.sqlite

-rw-rw-r-- 1 mark mark 138G Mar 1 12:19 SRAmetadb.sqlite

turbomam commented 6 months ago

select
    count(run_accession)
from
    run r ;

2,055,694

simroux commented 6 months ago

Looking into this sqlite version of SRA, one thing I'm confused about is the content of "run" vs "sra_ft".

As you show above, the run table as 2,055,694 distinct accessions
But a similar query on sra_ft shows: select count(distinct run_accession) from sra_ft

16,385,539

?

I am not sure how to explain the difference, I didn't find any obvious metadata that would be consistent and characterize the ~ 2.5M subset found in the run table.

Another note: the most recent dataset in the sqlite file seems to date from 2023-11-21, so it has been updated relatively recently.

Seems like it may be worth reaching out to the lab linked to this file (https://irp.nih.gov/pi/paul-meltzer ?) to see if they have plans for maintaining these files and/or code used to generate it ?

turbomam / biosample-xmldb-sqldb

Add code, URLs, screenshots etc to "Simon’s half-baked way of looking for relevant metagenomes in SRA" #10