turbomam / biosample-xmldb-sqldb

Tools for loading NCBI Biosample into an XML database and then transforming that into a SQL database
MIT License
0 stars 1 forks source link

Add code, URLs, screenshots etc to "Simon’s half-baked way of looking for relevant metagenomes in SRA" #10

Open turbomam opened 6 months ago

turbomam commented 6 months ago

Simon’s half-baked way of looking for relevant metagenomes in SRA

turbomam commented 6 months ago

First character:

Second character is always R

Third character:

turbomam commented 6 months ago
turbomam commented 6 months ago

https://www.bioconductor.org/packages/release/bioc/html/SRAdb.html

curl --output SRAmetadb.sqlite.gz  https://gbnci.cancer.gov/backup/SRAmetadb.sqlite.gz

% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6865M 100 6865M 0 0 20.9M 0 0:05:27 0:05:27 --:--:-- 23.0M

turbomam commented 6 months ago
gunzip SRAmetadb.sqlite.gz
ls -lh SRAmetadb.sqlite

-rw-rw-r-- 1 mark mark 138G Mar 1 12:19 SRAmetadb.sqlite

turbomam commented 6 months ago
select
    count(run_accession)
from
    run r ;

2,055,694

simroux commented 6 months ago

Looking into this sqlite version of SRA, one thing I'm confused about is the content of "run" vs "sra_ft".

?

I am not sure how to explain the difference, I didn't find any obvious metadata that would be consistent and characterize the ~ 2.5M subset found in the run table.

Another note: the most recent dataset in the sqlite file seems to date from 2023-11-21, so it has been updated relatively recently.

Seems like it may be worth reaching out to the lab linked to this file (https://irp.nih.gov/pi/paul-meltzer ?) to see if they have plans for maintaining these files and/or code used to generate it ?