labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
240 stars 28 forks source link

Difference between Persistent, Core (exact, soft), Shell, Cloud, and Accessory (exact, soft) Gene Families #288

Open ShuyiF opened 1 day ago

ShuyiF commented 1 day ago

Hi!

Thank you for this amazing tool! I'm really impressed at how fast it complete the task :) I built a pangenome with 150 Vibrio parahaemolyticus isolates, but was confused about interpreting different categories of families.

My code: ppanggolin all --anno list.tsv --cpu 8 --output prokka_ppanggolin2 --identity 0.95

And the results (content info) I got:

Content: Genes: 695067 Genomes: 150 Families: 21456 Edges: 27556 Persistent: Family_count: 3920 min_genomes_frequency: 0.89 max_genomes_frequency: 1.0 sd_genomes_frequency: 0.01 mean_genomes_frequency: 1.0 Shell: Family_count: 2392 min_genomes_frequency: 0.05 max_genomes_frequency: 0.95 sd_genomes_frequency: 0.21 mean_genomes_frequency: 0.21 Cloud: Family_count: 15144 min_genomes_frequency: 0.01 max_genomes_frequency: 0.05 sd_genomes_frequency: 0.01 mean_genomes_frequency: 0.01

Does this mean the total gene families in this pangenome is Families: 21456? Looks like Families= Persistent+Shell+Cloud. But persistent gene is not equal to core, right? I've noticed the frequency for persistent gene families is 0.89-1.0, I believe for core it should 0.95-1.0 considering shell and cloud are 0.05-0.95, and 0.01-0.05, respectively.

So I looked into partitions directory PPanGGOLiN generated. I checked the lines of each txt file to count the number of genes in each category. And I got results as below:

Screenshot 2024-09-26 at 4 32 54 PM

I'm wondering what's the threshold for each category and how to calculate the TOTAL families counts in this pangenome?

I guess total gene=exact core+exact accessory=soft core+soft accessory=core+shell+cloud? BUT, why shell+cloud is not equal to neither exact accessory nor soft accessory? How could I interpret this?

Also, is there anyway to set the threshold for core, shell, and cloud? I'm confused when I change my --identity from 0.95 to 0.90, the threshold hold for shell change from 0.05-0.95 to 0.05-0.91. Shouldn't the --identity only be used to set "Minimal identity percent for two proteins to be in the same cluster"?

(Below is the content info for --identity 0.90:

Content: Genes: 695067 Genomes: 150 Families: 19512 Edges: 25428 Persistent: Family_count: 3959 min_genomes_frequency: 0.89 max_genomes_frequency: 1.0 sd_genomes_frequency: 0.01 mean_genomes_frequency: 1.0 Shell: Family_count: 2146 min_genomes_frequency: 0.05 max_genomes_frequency: 0.91 sd_genomes_frequency: 0.21 mean_genomes_frequency: 0.22 Cloud: Family_count: 13407 min_genomes_frequency: 0.01 max_genomes_frequency: 0.05 sd_genomes_frequency: 0.01 mean_genomes_frequency: 0.01)

Thank you in advance and I look forward to your reply! :)

axbazin commented 22 hours ago

Hello, thank you for your kind words !

There was quite a few questions I will try to answer all of them. First of all, about the numbers of families and the number in each category:

The number of families is indeed 21456 in your pangenome. Within ppanggolin, you have access to 3 kind of partitionning:

Those 3 "type" of partitions are all perfectly independent, which is why they don't add up.

About the identity threshold, indeed it does what you wrote down, it sets the "Minimal identity percent for two proteins to be in the same cluster", and nothing else. The shell 'limits' of gene family frequency can and will change independently of that threshold.

To further clarify about the statistical model: it does not set a threshold. This is not a statistical model like mOTUpan or micropan that works on frequency. It uses 2 information: the pattern of presence (or absence) of gene families through a Bernoulli Mixture Model, and the genome organisation through a Markov Random Field.

To give a simplified numerical example: gene families slightly "less" present (for example, present in 90% of genomes) but surrounded by gene families that are present in most genomes (e.g. 100%), are more inclined to be included in the persistent genome. On the contrary, a gene family present in 90% of genomes and surrounded by gene families present in 50% of genomes will be more often found in the "shell" genome. (Numbers are only given to picture how the method works).

I hope that this clarifies everything ! In case it is not enough, if you want to deepen your understanding of the ppanggolin statistical model, it is entirely described in this article: https://doi.org/10.1371/journal.pcbi.1009687

Adelme

ShuyiF commented 11 hours ago

Hi Adelme,

Thank you so much for your prompt reply. It helps a lot and I truly appreciate it!

I have one follow-up question:

You mentioned that 'The shell 'limits' of gene family frequency can and will change independently of that --identity'. Then why did the shell frequency threshold change after I changed my --identity from 0.90 to 0.95? I'm confused. What will affect the shell frequency threshold? Only the inputs?

Thank you!