Difference between Persistent, Core (exact, soft), Shell, Cloud, and Accessory (exact, soft) Gene Families

ShuyiF commented 1 day ago

Hi!

Thank you for this amazing tool! I'm really impressed at how fast it complete the task :) I built a pangenome with 150 Vibrio parahaemolyticus isolates, but was confused about interpreting different categories of families.

My code: ppanggolin all --anno list.tsv --cpu 8 --output prokka_ppanggolin2 --identity 0.95

And the results (content info) I got:

Content: Genes: 695067 Genomes: 150 Families: 21456 Edges: 27556 Persistent: Family_count: 3920 min_genomes_frequency: 0.89 max_genomes_frequency: 1.0 sd_genomes_frequency: 0.01 mean_genomes_frequency: 1.0 Shell: Family_count: 2392 min_genomes_frequency: 0.05 max_genomes_frequency: 0.95 sd_genomes_frequency: 0.21 mean_genomes_frequency: 0.21 Cloud: Family_count: 15144 min_genomes_frequency: 0.01 max_genomes_frequency: 0.05 sd_genomes_frequency: 0.01 mean_genomes_frequency: 0.01

Does this mean the total gene families in this pangenome is Families: 21456? Looks like Families= Persistent+Shell+Cloud. But persistent gene is not equal to core, right? I've noticed the frequency for persistent gene families is 0.89-1.0, I believe for core it should 0.95-1.0 considering shell and cloud are 0.05-0.95, and 0.01-0.05, respectively.

So I looked into partitions directory PPanGGOLiN generated. I checked the lines of each txt file to count the number of genes in each category. And I got results as below:

I'm wondering what's the threshold for each category and how to calculate the TOTAL families counts in this pangenome?

I guess total gene=exact core+exact accessory=soft core+soft accessory=core+shell+cloud? BUT, why shell+cloud is not equal to neither exact accessory nor soft accessory? How could I interpret this?

Also, is there anyway to set the threshold for core, shell, and cloud? I'm confused when I change my --identity from 0.95 to 0.90, the threshold hold for shell change from 0.05-0.95 to 0.05-0.91. Shouldn't the --identity only be used to set "Minimal identity percent for two proteins to be in the same cluster"?

(Below is the content info for --identity 0.90:

Content: Genes: 695067 Genomes: 150 Families: 19512 Edges: 25428 Persistent: Family_count: 3959 min_genomes_frequency: 0.89 max_genomes_frequency: 1.0 sd_genomes_frequency: 0.01 mean_genomes_frequency: 1.0 Shell: Family_count: 2146 min_genomes_frequency: 0.05 max_genomes_frequency: 0.91 sd_genomes_frequency: 0.21 mean_genomes_frequency: 0.22 Cloud: Family_count: 13407 min_genomes_frequency: 0.01 max_genomes_frequency: 0.05 sd_genomes_frequency: 0.01 mean_genomes_frequency: 0.01)

Thank you in advance and I look forward to your reply! :)

axbazin commented 22 hours ago

Hello, thank you for your kind words !

There was quite a few questions I will try to answer all of them. First of all, about the numbers of families and the number in each category:

The number of families is indeed 21456 in your pangenome. Within ppanggolin, you have access to 3 kind of partitionning:

"persistent - shell - cloud" partitionning, which is the one we recommend for most usage. The partitionning uses a statistical model (described in the original ppanggolin article), making it much more robust to variabilities such as population structures, errors, and potential artefacts coming from genome incompleteness if you work on MAGs/SAGs or illumina-only genomes. This statistical model has 2 big advantages: it will not use a hardcoded threshold like the more usual "soft core/soft accessory" approach, and will adapt to the dataset it is given.
- The "persistent" corresponds to gene families present in 'most' genomes. This makes up the gene families that define the species you are working on. It improves on the "core" or "soft-core" definition as it is much more stable when subsampled, it should remain roughtly the same no matter the genomes used in your dataset.
- The "Shell" corresponds to gene families present in some genomes, potentially in subgroups or subpopulations of your dataset. If your pangenome is strongly structured, you will often identify the gene families that are related specifically to each subpopulation in that partition.
- Then, the "Cloud" corresponds to the rare gene families, present only in a minority of genomes.
"exact core - exact accessory", which can be useful for some applications such as computing phylogenetic trees using exact core gene markers. "Exact core" corresponds to gene families present strictly in 100% of your genomes, and the exact accessory is all the rest.
"soft core - soft accessory", which is the 'usual' way of doing of a lot of other pangenomic tools. It is there for comparison purposes but we consider it rarely useful. It uses the 1.0-0.95 threshold that you mentionned by default, but this can be customized freely. Every family present in over 95% of your genomes will be in the soft core, and the rest will be in the soft accessory.

Those 3 "type" of partitions are all perfectly independent, which is why they don't add up.

About the identity threshold, indeed it does what you wrote down, it sets the "Minimal identity percent for two proteins to be in the same cluster", and nothing else. The shell 'limits' of gene family frequency can and will change independently of that threshold.

To further clarify about the statistical model: it does not set a threshold. This is not a statistical model like mOTUpan or micropan that works on frequency. It uses 2 information: the pattern of presence (or absence) of gene families through a Bernoulli Mixture Model, and the genome organisation through a Markov Random Field.

To give a simplified numerical example: gene families slightly "less" present (for example, present in 90% of genomes) but surrounded by gene families that are present in most genomes (e.g. 100%), are more inclined to be included in the persistent genome. On the contrary, a gene family present in 90% of genomes and surrounded by gene families present in 50% of genomes will be more often found in the "shell" genome. (Numbers are only given to picture how the method works).

I hope that this clarifies everything ! In case it is not enough, if you want to deepen your understanding of the ppanggolin statistical model, it is entirely described in this article: https://doi.org/10.1371/journal.pcbi.1009687

Adelme

ShuyiF commented 11 hours ago

Hi Adelme,

Thank you so much for your prompt reply. It helps a lot and I truly appreciate it!

I have one follow-up question:

You mentioned that 'The shell 'limits' of gene family frequency can and will change independently of that --identity'. Then why did the shell frequency threshold change after I changed my --identity from 0.90 to 0.95? I'm confused. What will affect the shell frequency threshold? Only the inputs?

Thank you!

labgem / PPanGGOLiN

Difference between Persistent, Core (exact, soft), Shell, Cloud, and Accessory (exact, soft) Gene Families #288