sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
479 stars 79 forks source link

evaluating scaled parameter settings for viral genomes #1340

Open bluegenes opened 3 years ago

bluegenes commented 3 years ago

For the 266,805 genomes in the pigeon 1.0 database:

genome length info: min: 1,609 bp max: 1,145,228 bp median: 17,304 bp mean: 23,902 bp

summary stats for number of hashes per genome sketch:

                     min    max median
 ksize scaled                                   
 21    100            10  11276    173
       200             4   5632     87
       1000            0   1119     18
       2000            0    532      9
       10000           0     95      2
 31    100            16  11458    173
       200             7   5723     87
       1000            0   1175     18
       2000            0    584      9
       10000           0    125      2
 51    100             7  11322    172
       200             3   5663     86
       1000            0   1096     18
       2000            0    547      9
       10000           0    128      2

slack discussion w/ @ctb -

titus:speech_balloon: so …basically, I see that for viruses, we should just standardize on scaled=100 for all k-mer sizes.

bluegenes:feet: i would agree! do you have an idea of the min number of hashes required for good matching?

titus:speech_balloon: at DNA level? I trust 2, generally set threshold at 3

bluegenes:feet: so might be able to get away with scaled=200

titus:speech_balloon: I think factor of 2 improvement in space vs massive increase in certainty… is not worth it :slightly_smiling_face:

bluegenes commented 3 years ago

In protein space (after translation of these genomes via prodigal):

summary stats for number of hashes per prodigal proteome sketch:

                     min   max median
 ksize scaled                        
 7     100             0  2999     54
       200             0  1523     27
       500             0   600     11
       1000            0   326      6
       2000            0   161      3
 8     100             0  3067     53
       200             0  1505     27
       500             0   599     11
       1000            0   298      6
       2000            0   163      3
 9     100             0  3075     53
       200             0  1535     27
       500             0   624     11
       1000            0   331      6
       2000            0   180      3
 10    100             0  3121     53
       200             0  1500     27
       500             0   599     11
       1000            0   290      6
       2000            0   150      3
 11    100             0  3003     53
       200             0  1538     27
       500             0   619     11
       1000            0   314      5
       2000            0   151      3
 12    100             0  3012     52
       200             0  1532     26
       500             0   603     11
       1000            0   303      5
       2000            0   152      3

Perhaps more interesting:

3 genomes of 266805 had zero prodigal predicted proteins. Of the 266802 with proteins, here are the numbers with zero hashes:

Number of sketches with zero hashes:

ksize  scaled
7      100           3
       200           5
       500         178
       1000       3352
       2000      22847
8      100           3
       200           5
       500         153
       1000       3426
       2000      22829
9      100           3
       200           4
       500         132
       1000       3594
       2000      23377
10     100           2
       200           5
       500         153
       1000       3556
       2000      23626
11     100           2
       200           7
       500         163
       1000       3654
       2000      23507
12     100           3
       200           6
       500         157
       1000       3759
       2000      23988

(truncated) proteome length distribution:

image

more info: https://github.com/bluegenes/2021-virus-exploration/blob/main/notebooks/001.pigeon1.0-stats.ipynb

ctb commented 3 years ago

cool! so it looks like a scaled of 200 is a pretty good compromise, yah? half the space of 100, with virtually the same number of misses.