ctb commented 2 years ago

gave a talk at USC yesterday, and Mark Chaisson suggested that we consider preprocessing the highly redundant databases to remove some of the redundancy, and/or adjusting the approach in light of known redundancy. one particular idea was (IIRC) to reduce memory for human microbiome data sets by pre-selecting some sketches and species that tend to show up in them.

ctb commented 2 years ago

making a database minimum cover

implemented some stuff over in https://github.com/ctb/2022-database-covers.

In brief,

walk through a pile of sketches, tracking all previously seen hashes
for each new sketch, subtract all previous hashes from sketch
save downsized sketch to new database, add hashes to previously seen sketches

At the end of the process, you should have a new 'cover' database where the hash content of the database is the same as the original, but there is no redundancy between the sketches.

A simpler version (that would be not nearly as effective) would be to simply remove any sketches where exact duplicates were present.

Either way, you should be able to identify the same number of hashes for a given query. Initial query results are promising but we may need to adjust thresholds for reporting.

summary

database	num sketches	total hashes	file size
gtdb entire	258,406	98,501,934	901 MB
gtdb cover	154,949	17,237,504	214 MB

So by removing redundant portions of sketches we eliminate 40% of the sketches and 83% of the hashes in the database, and decrease the file size by 76%.

downsampled GTDB rs202 to 10k:

GTDB entire, at a scaled of 10000:

sourmash sig fileinfo gtdb-rs202.genomic.k31.scaled\=10000.zip 

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.scaled=10000.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.scaled=10000.zip
is database? yes
has manifest? yes
num signatures: 258406
** examining manifest...
total hashes: 98501934
summary of sketches:
   258406 sketches with DNA, k=31, scaled=10000, abund 98501934 total hashes

building a minimum set cover for the database:

then ran:

% ./2022-database-covers/make-db-cover.py gtdb-rs202.genomic.k31.scaled\=10000.zip -o gtdb-rs202.genomic.k31.cover.zip

which progressively discarded already seen hashes.

% sourmash sig fileinfo gtdb-rs202.genomic.k31.cover.zip

== This is sourmash version 4.2.4.dev27+g0a1b713f. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

** loading from 'gtdb-rs202.genomic.k31.cover.zip'
path filetype: ZipFileLinearIndex
location: /group/ctbrowngrp2/ctbrown/cover/gtdb-rs202.genomic.k31.cover.zip
is database? yes
has manifest? yes
num signatures: 154949
** examining manifest...
total hashes: 17237504
summary of sketches:
   154949 sketches with DNA, k=31, scaled=10000, abund 17237504 total hashes

ctb commented 2 years ago

Did some validation on the covers code.

# grab the k-31 signatures from podar-ref database
sourmash sig cat ../sourmash/podar-ref.zip -k 31 -o podar-ref.zip

# make a database cover
./make-db-cover.py podar-ref.zip -o podar-ref.cover.zip --scaled=1000

# build a single merged signature from the database cover
sourmash sig merge podar-ref.cover.zip -o podar-ref.cover.merge.zip

# build a single merged signature from the original database
sourmash sig merge podar-ref.zip -o podar-ref.merge.zip

# compare!
sourmash sig overlap podar-ref.merge.zip podar-ref.cover.merge.zip

and at the end you do indeed see that the cover contains all the same k-mers as the original database:

== This is sourmash version 4.2.5.dev19+g3a6028fb.d20220217. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loaded one signature each from podar-ref.merge.zip and podar-ref.cover.merge.zip
first signature:
  signature filename: podar-ref.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

second signature:
  signature filename: podar-ref.cover.merge.zip
  signature: 
  md5: 199f2f9d45b82cff605f231b1803c912
  k=31 molecule=DNA num=0 scaled=1000

similarity:                  1.00000
first contained in second:   1.00000
second contained in first:   1.00000

number of hashes in first:   200209
number of hashes in second:  200209

number of hashes in common:  200209
only in first:               0
only in second:              0
total (union):               200209

ctb commented 2 years ago

I built a GTDB cover at 100k, as well as a GTDB+genbank cover. I was nervous about how slow things would be with a large .zip, so I made an SBT for the gtdb+genbank cover.

Running it against SRR11669366 shows that the SBT of ~600,000 sketches is actually faster than the .zip, although it does use more memory 🎉

database	time	memory
gtdb cover (.zip)	67s	488 MB
gtdb+genbank SBT	60s	1727 MB

ctb commented 2 years ago

The main remaining thing I need to evaluate out for actually using these database covers is: how do the results compare when running gather against a cover vs running gather against the full database? I'm getting somewhat different results. My intuition tells me it's probably about the thresholds I'm providing, but I need to dig into this.

ctb commented 2 years ago

@bluegenes made the excellent point that these database covers will no longer work for strain-level identification, although they will probably work great for species and genus level identification.

taylorreiter commented 2 years ago

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

taylorreiter commented 2 years ago

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

ctb commented 2 years ago

I think these databases could be a really great way for increasing the speed of charcoal -- I'm testing it now. prefetch is failing however. I'm using it with sourmash 4.2.3, is there an obvious reason that it would fail? It doesn't appear to be a memory problem, and it fails silently until snakemake records the failed rule.

yes, I think you want to update to sourmash 4.3.0 so that you get #1866 and #1870. I vaguely recall that those two fixes were connected to this functionality :)

ctb commented 2 years ago

Another note -- it would be really great if one could select specific genomes to seed from -- like it would be wonderful to be able to tell the cover to include first all of the GTDB reps, and then add in hashes from other signatures secondarily. That way, well-loved organisms will be most likely to be the result that is returned (e.g. E. coli K12, P. aeruginosa PA14, etc), which I think is helpful when trying to integrate with downstream analysis resources.

should work as is - just specify the genomes or collections with priority first. It should be ok if the databases are redundant (e.g. GTDB genomic-reps first, all GTDB next) because the redundant signatures will be removed.

ctb commented 2 years ago

note that using this for charcoal when searching gtdb against self might be dangerous - contamination is one of the things that will present weirdly with database covers.

ctb commented 2 years ago

aaaaand a note on the note -

it could be really interesting to store just the differences between sketches in some complicated data structure... I think this is similar to #545 / AllSome or Split SBTs.

ctb commented 2 years ago

see also https://github.com/sourmash-bio/database-examples/pull/1 for a species-level merged GTDB db.

ctb commented 2 years ago

note, one problem with species-level merges is that they require adhering to a taxonomy, which the database min set cov does not.

curious if we could do something with agglomerative clustering on overlaps of (say) 100kb that would get us there without a taxonomy 🤔

(requiring a taxonomy changes the tenor of sourmash databases, I think, so it's a step to be taken with care)

ctb commented 1 year ago

updates happening to codebase! some small cleanup; running gtdb eps first; outputting to directory for use with fastmultigather per run.sh

sooner or later I will make into plugin, too.

sourmash-bio / sourmash

explore building database minimum set covers #1852

making a database minimum cover

summary

downsampled GTDB rs202 to 10k:

building a minimum set cover for the database: