sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
475 stars 80 forks source link

terribly large data sets #2765

Open ctb opened 1 year ago

ctb commented 1 year ago

per luiz:

we have a new champion! https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR21113412&display=metadata is the largest signature in wort, 4.1GB :joy:

trace.ncbi.nlm.nih.govtrace.ncbi.nlm.nih.gov SRA Archive: NCBI NCBI Sequence Read Archive

this fellow: https://www.ncbi.nlm.nih.gov/Taxonomy/taxi/images/15060

:joy: :sob:

% sourmash sig describe /data/wort/wort-sra/sigs/SRR21113412.sig

== This is sourmash version 4.8.4.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /data/wort/wort-sra/sigs/SRR21113412.sig
signature: SRR21113412
source file: -
md5: de258229d7405173c506b72a8a77faae
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 172806396
sum hashes: 277534280
signature license: CC0

---
signature filename: /data/wort/wort-sra/sigs/SRR21113412.sig
signature: SRR21113412
source file: -
md5: 6a3caaba5c8bb75fe08da77ec1831d35
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 247405570
sum hashes: 279277902
signature license: CC0

---
signature filename: /data/wort/wort-sra/sigs/SRR21113412.sig
signature: SRR21113412
source file: -
md5: aa2cb69bcbf138fdc92513a298bde44f
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 275125309
sum hashes: 280660045
signature license: CC0

loaded 3 signatures total, from 1 files

took 25 minutes to run this sig describe :joy:

muuuuuch more manageable:

$ sourmash sig filter -m 2 -o SRR21113412.sig /data/wort/wort-sra/sigs/SRR21113412.sig

== This is sourmash version 4.8.4.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded 3 total that matched ksize & molecule type
extracted 3 signatures from 1 file(s)
$ sourmash sig describe SRR21113412.sig

== This is sourmash version 4.8.4.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: SRR21113412.sig
signature: SRR21113412
source file: -
md5: 379a21daefdef0c0f150b78a5da7d274
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 26534720
sum hashes: 131262604
signature license: CC0

---
signature filename: SRR21113412.sig
signature: SRR21113412
source file: -
md5: 1f41440f567977b1e860521f7f962549
k=31 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 8244215
sum hashes: 40116547
signature license: CC0

---
signature filename: SRR21113412.sig
signature: SRR21113412
source file: -
md5: 08728464c0f7f8d2ddf2c423ecbee351
k=51 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=1
size: 1167072
sum hashes: 6701808
signature license: CC0

loaded 3 signatures total, from 1 files

(26m for filter, 42s for describe)

it's cool how 51- and 31-mers cardinality increase for unfiltered, but decrease for filtered (when compared to 21-mers)

no guarantees on the quality of results, but mastiff gather took 19m54s on the filtered one, and is running for 2h+ on the original one :joy:

wondering how well isolated was the genome before sequencing :upside_down_face: ❯ head SRR21113412.csv GCF_002222635.1 Sulfitobacter pseudonitzschiae strain=SMR1, ASM222263v1 4648000 0.9335207873066881 GCF_005144905.1 Vibrio cyclitrophicus strain=ECSMB14105, ASM514490v1 4465000 0.9093686354378818 GCA_001562115.1 Alteromonas stellipolaris strain=LMG 21861, ASM156211v1 3835000 0.8013383521539105 GCA_007988745.1 Pseudoalteromonas atlantica strain=NBRC 103033, ASM798874v1 3800000 0.8508535489667565 GCF_002115725.1 Marivita cryptomonadis strain=CL-SK44, ASM211572v1 3349000 0.719412019022914 GCF_000733925.1 Arenibacter algicola strain=TG409, ASM73392v1 3287000 0.6098330241187384 GCA_000831005.1 Marinobacter salarius strain=R9SW1, ASM83100v1 2782000 0.6007785467128027 GCF_002890895.1 Pseudomonas stutzeri strain=4C29, ASM289089v1 2653000 0.5937288517933679 GCF_000014745.1 Maricaulis maris MCS10 strain=MCS10, ASM1474v1 2515000 0.7632776934749621 GCF_001447995.1 Maribacter dokdonensis DSW-8 strain=DSW-8, DSW8_denovo_v1 2388000 0.5383397421397874 luizirber

same index, but with -s 10000 takes 2m53s to run: ❯ head SRR21113412.csv GCF_002222635.1 Sulfitobacter pseudonitzschiae strain=SMR1, ASM222263v1 4570000 0.9364754098360656 GCF_005144905.1 Vibrio cyclitrophicus strain=ECSMB14105, ASM514490v1 4160000 0.8813559322033898 GCA_001562115.1 Alteromonas stellipolaris strain=LMG 21861, ASM156211v1 3930000 0.7875751503006012 GCA_007988745.1 Pseudoalteromonas atlantica strain=NBRC 103033, ASM798874v1 3880000 0.8308351177730193 GCF_002115725.1 Marivita cryptomonadis strain=CL-SK44, ASM211572v1 3400000 0.7100840336134454 GCF_000733925.1 Arenibacter algicola strain=TG409, ASM73392v1 3230000 0.6003717472118959 GCA_000831005.1 Marinobacter salarius strain=R9SW1, ASM83100v1 2850000 0.6319290465631929 GCF_002890895.1 Pseudomonas stutzeri strain=4C29, ASM289089v1 2790000 0.610989010989011 GCF_000014745.1 Maricaulis maris MCS10 strain=MCS10, ASM1474v1 2510000 0.7652439024390244 GCA_009649675.1 Alphaproteobacteria bacterium HT1-32 strain=HT1-32, ASM964967v1 2480000 0.5210970464135021

food for thought: can't do this easily without redownloading all the metagenomes and sketching with scaled=100, but a SRA metag index would be ~10TB (which fits in HDDs/SSDs) and would potentially allow viral queries?

unfiltered/s1000 finished after 3h34m, but I didn't save the output properly and can't compare :joy: unfiltered/s10000 is running now, should finish soon

unfiltered/s10000 took 37m29s, top results:

$ head SRR21113412-10k-unfiltered.csv
GCF_002222635.1 Sulfitobacter pseudonitzschiae strain=SMR1, ASM222263v1 4680000 0.9590163934426229
GCF_005144905.1 Vibrio cyclitrophicus strain=ECSMB14105, ASM514490v1 4280000 0.9067796610169492
GCA_001562115.1 Alteromonas stellipolaris strain=LMG 21861, ASM156211v1 4070000 0.8156312625250501
GCA_007988745.1 Pseudoalteromonas atlantica strain=NBRC 103033, ASM798874v1 4060000 0.8693790149892934
GCF_002115725.1 Marivita cryptomonadis strain=CL-SK44, ASM211572v1 4070000 0.8508403361344538
GCF_001931535.1 Minicystis rosea strain=DSM 24000, ASM193153v1 3700000 0.23270440251572327
GCF_016863635.1 Virgisporangium aurantiacum strain=NBRC 16421, ASM1686363v1 3540000 0.24877020379479972
GCF_000733925.1 Arenibacter algicola strain=TG409, ASM73392v1 3530000 0.6561338289962825
GCF_000418325.1 Sorangium cellulosum So0157-2 strain=So0157-2, ASM41832v1 3580000 0.23509711989283322
GCA_000831005.1 Marinobacter salarius strain=R9SW1, ASM83100v1 3140000 0.6962305986696231
luizirber commented 1 year ago

Note: I don't think this is supposed to be a metagenome, this seems to be a genome assembly project! gather won't capture the actual organism because it is an euk and there are no euks in the rs207 reference database I used for gather. But the reads certainly seem to have microbial contamination going on =]

wort computes signatures for euks as long as they are not animal or plant, since this is algae it was calculated too.

ctb commented 1 year ago

and also: that big file alone had a total of 23,000 bacterial genomes in the gather file and 12,300 archaea, protozoa, fungi, and viruses in the gather. Total of > 35,000.

from hugo.