Open bluegenes opened 3 years ago
In protein space (after translation of these genomes via prodigal
):
summary stats for number of hashes per prodigal proteome sketch:
min max median
ksize scaled
7 100 0 2999 54
200 0 1523 27
500 0 600 11
1000 0 326 6
2000 0 161 3
8 100 0 3067 53
200 0 1505 27
500 0 599 11
1000 0 298 6
2000 0 163 3
9 100 0 3075 53
200 0 1535 27
500 0 624 11
1000 0 331 6
2000 0 180 3
10 100 0 3121 53
200 0 1500 27
500 0 599 11
1000 0 290 6
2000 0 150 3
11 100 0 3003 53
200 0 1538 27
500 0 619 11
1000 0 314 5
2000 0 151 3
12 100 0 3012 52
200 0 1532 26
500 0 603 11
1000 0 303 5
2000 0 152 3
Perhaps more interesting:
3 genomes of 266805 had zero prodigal predicted proteins. Of the 266802 with proteins, here are the numbers with zero hashes:
Number of sketches with zero hashes:
ksize scaled
7 100 3
200 5
500 178
1000 3352
2000 22847
8 100 3
200 5
500 153
1000 3426
2000 22829
9 100 3
200 4
500 132
1000 3594
2000 23377
10 100 2
200 5
500 153
1000 3556
2000 23626
11 100 2
200 7
500 163
1000 3654
2000 23507
12 100 3
200 6
500 157
1000 3759
2000 23988
(truncated) proteome length distribution:
more info: https://github.com/bluegenes/2021-virus-exploration/blob/main/notebooks/001.pigeon1.0-stats.ipynb
cool! so it looks like a scaled of 200 is a pretty good compromise, yah? half the space of 100, with virtually the same number of misses.
For the 266,805 genomes in the pigeon 1.0 database:
genome length info: min: 1,609 bp max: 1,145,228 bp median: 17,304 bp mean: 23,902 bp
summary stats for number of hashes per genome sketch:
slack discussion w/ @ctb -
titus:speech_balloon: so …basically, I see that for viruses, we should just standardize on scaled=100 for all k-mer sizes.
bluegenes:feet: i would agree! do you have an idea of the min number of hashes required for good matching?
titus:speech_balloon: at DNA level? I trust 2, generally set threshold at 3
bluegenes:feet: so might be able to get away with scaled=200
titus:speech_balloon: I think factor of 2 improvement in space vs massive increase in certainty… is not worth it :slightly_smiling_face: