Open ctb opened 3 years ago
on gtdb-rs202.genomic.k31.zip,
1 113954776
2 26894308
3 10427980
4 5431686
5 3253975
6 2136872
7 1561598
8 1092454
9 864976
10 676923
11 548505
12 446904
13 389882
14 329652
15 275942
16 250355
17 218101
18 211963
19 178064
20 156755
21 138842
22 133062
23 126427
24 120191
25 98476
26 96547
27 86526
28 76934
29 71261
30 65431
31 63900
32 59438
33 56440
35 53494
34 53153
36 44333
37 42850
38 41132
39 39460
...
4219 1
4160 1
4133 1
4128 1
4106 1
4101 1
4064 1
4003 1
3957 1
3953 1
3933 1
3825 1
3702 1
3678 1
3553 1
3416 1
3374 1
3047 1
3010 1
2979 1
2965 1
2943 1
2790 1
2738 1
2473 1
so there are lots of distinct hashes in thousands (!!) of genomes. These are all at a scaled of 1000 and ksize of 31, so e.g.
1 113954776
2 26894308
3 10427980
means there are an estimated 114 billion 31-mers that show up once across all of GTDB, 27 billion 31-mers that show up twice, 10 billion k-mers that show up three times, etc. It falls off pretty quickly - there are ~677 million k-mers that are in 10 genomes - but still, big implications for GTDB pangenomes use!
stumbled into a ridiculous pangenome example in https://github.com/dib-lab/genome-grist/issues/53#issuecomment-847022967 - 250k salmonella genomes, only one of 'em in the sample.
I wrote a script that calculates the distribution of hashes in collections of genomes. one goal is to generate some statistics for what the pangenome overlap is between genomes. It's a pretty straightforward script!
Script will be provided shortly in a gist, but here are some results.
on ~48k genomes, GTDB representatives, from
gtdb-rs202.genomic-reps.k31.zip
-