Open turbomam opened 6 months ago
First character:
Second character is always R
Third character:
https://www.bioconductor.org/packages/release/bioc/html/SRAdb.html
curl --output SRAmetadb.sqlite.gz https://gbnci.cancer.gov/backup/SRAmetadb.sqlite.gz
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6865M 100 6865M 0 0 20.9M 0 0:05:27 0:05:27 --:--:-- 23.0M
gunzip SRAmetadb.sqlite.gz
ls -lh SRAmetadb.sqlite
-rw-rw-r-- 1 mark mark 138G Mar 1 12:19 SRAmetadb.sqlite
select
count(run_accession)
from
run r ;
2,055,694
Looking into this sqlite version of SRA, one thing I'm confused about is the content of "run" vs "sra_ft".
select count(distinct run_accession) from sra_ft
16,385,539
?
I am not sure how to explain the difference, I didn't find any obvious metadata that would be consistent and characterize the ~ 2.5M subset found in the run table.
Another note: the most recent dataset in the sqlite file seems to date from 2023-11-21, so it has been updated relatively recently.
Seems like it may be worth reaching out to the lab linked to this file (https://irp.nih.gov/pi/paul-meltzer ?) to see if they have plans for maintaining these files and/or code used to generate it ?
Simon’s half-baked way of looking for relevant metagenomes in SRA