sourmash-bio / sourmash

Quickly search, compare, and analyze genomic and metagenomic data sets.
http://sourmash.readthedocs.io/en/latest/
Other
455 stars 78 forks source link

pangenome investigation results #1548

Open ctb opened 3 years ago

ctb commented 3 years ago

I wrote a script that calculates the distribution of hashes in collections of genomes. one goal is to generate some statistics for what the pangenome overlap is between genomes. It's a pretty straightforward script!

Script will be provided shortly in a gist, but here are some results.

on ~48k genomes, GTDB representatives, from gtdb-rs202.genomic-reps.k31.zip -

1 135322375
2 5825082
3 1281786
4 493788
5 249478
6 147410
7 95203
8 64869
9 46579
10 34749
11 26406
12 21067
13 16817
14 13705
15 11444
16 9512
17 8090
18 6945
19 6022
20 5255
21 4647
22 4062
23 3605
24 3345
25 2934
26 2645
27 2363
28 2170
29 1970
30 1850
31 1682
32 1555
33 1382
34 1375
35 1167
36 1104
37 1055
38 1013
39 864
40 792
41 746
43 732
42 725
44 638
45 621
47 562
46 542
48 534
49 500
50 460
51 448
53 417
54 416
55 380
52 378
57 326
56 320
58 301
59 297
60 266
62 255
63 246
66 243
61 240
64 230
67 222
65 213
68 200
70 189
73 185
69 184
71 173
72 166
76 164
74 164
77 154
75 143
81 140
...
332 1
329 1
327 1
321 1
319 1
318 1
316 1
315 1
312 1
304 1
303 1
302 1
298 1
294 1
291 1
290 1
285 1
ctb commented 3 years ago

on gtdb-rs202.genomic.k31.zip,

1 113954776
2 26894308
3 10427980
4 5431686
5 3253975
6 2136872
7 1561598
8 1092454
9 864976
10 676923
11 548505
12 446904
13 389882
14 329652
15 275942
16 250355
17 218101
18 211963
19 178064
20 156755
21 138842
22 133062
23 126427
24 120191
25 98476
26 96547
27 86526
28 76934
29 71261
30 65431
31 63900
32 59438
33 56440
35 53494
34 53153
36 44333
37 42850
38 41132
39 39460
...
4219 1
4160 1
4133 1
4128 1
4106 1
4101 1
4064 1
4003 1
3957 1
3953 1
3933 1
3825 1
3702 1
3678 1
3553 1
3416 1
3374 1
3047 1
3010 1
2979 1
2965 1
2943 1
2790 1
2738 1
2473 1

so there are lots of distinct hashes in thousands (!!) of genomes. These are all at a scaled of 1000 and ksize of 31, so e.g.

1 113954776
2 26894308
3 10427980

means there are an estimated 114 billion 31-mers that show up once across all of GTDB, 27 billion 31-mers that show up twice, 10 billion k-mers that show up three times, etc. It falls off pretty quickly - there are ~677 million k-mers that are in 10 genomes - but still, big implications for GTDB pangenomes use!

ctb commented 3 years ago

stumbled into a ridiculous pangenome example in https://github.com/dib-lab/genome-grist/issues/53#issuecomment-847022967 - 250k salmonella genomes, only one of 'em in the sample.