Closed shailabhr closed 2 years ago
Hi,
The persistent genes are present in most individuals, however they can be absent in some of them. The idea behind it is that, often when you work with an important amount of genomes, there can artefacts or errors (sequencing errors, annotation errors and so on) in some of the genomes, making the number of genes present in strictly all genomes lower and lower as you add more genomes to your pangenome.
The persistent genomes is computed using a statistical method called NEM to try to take into account this variability in genomic datasets. This is why some genes in the persistent can be absent in some genomes. If you want to know more about it, the method is heavily detailed in the ppanggolin paper (https://journals.plos.org/ploscompbiol/article?rev=2&id=10.1371/journal.pcbi.1007732 ).
If you specifically want the genes present in all genomes, you should be looking at the "exact core" partition. The gene families belonging to this partition are listed in "./partitions/exact_core.txt" when you call the following command:
ppanggolin write -p pangenome.h5 --partitions
Adelme
Hi, Thank you for the reply. I am actually using this to create a pangenome of a whole genus. I have more than 100 different species. Do you think this algorithm will work for genus wide study? Also, I do expect that there be a very less number of persistent gene families. The persistent genome will be hard to decide here because they are different species though they share a lot of similarities. Isn't it true?
Hi,
This algorithm can work for genus study, we've done some pangenomes at the genus level in our lab before (on Acinetobacter mostly, with around 100 species). However, you might want to use a lower threshold for the clustering step. By default ppanggolin clusters genes with a minimal threshold at 80% identity, which is perfectly fine for species pangenomes, but probably too high for genus pangenomes. For genus you might want to use something like 50% identity or lower, or use external clustering methods.
To change the threshold for the clustering step, you'll need to go through the 'step by step pangenome analysis' and use the --identity option at the cluster step. (see https://github.com/labgem/PPanGGOLiN/wiki/PPanGGOLiN---step-by-step-pangenome-analysis#clustering for some documentation )
For the Acinetobacter pangenome we used 50% identity.
I can't really tell about the 'hardness' of finding the persistent genome. Persistent genome will be defined in any case but its content may vary. In any case I expect genes present in most of the genomes in your genus to be in the persistent genome, as long as the clustering goes well.
Adelme
Hi, I tried to change the identity and the coverage and then regenerated the data. I made it to 50%. I don't see any change in the number of persistent, shell, and cloud genomes (the number of persistent gene families went up by just 2). Are there any other parameters playing role in determining these numbers? OR maybe I am doing something wrong here. Kindly let me know. I am actually doing it in Google colab and the whole thing can be easily shared in case you would want to have a look at it. Thank you.
Hi, That seems a bit surprising to me, usually it changes things quite a lot. If indeed it's possible to share the google colab so I can take a look that would be awesome !
Adelme
I think that will be great. Could you please let me know your email address at shailabh.rauniyar@mines.sdsmt.edu Thank you.
Hi, I generated the data using 109 genomes. As per the theory, the persistent genes must be present in all the genomes under study. I got 1600 plus persistent gene families, however, the matrix file shows those persistent genes to be present in only few genomes. I am unable to explain this. All the genome files are Prokka annotated used for this. Kindly help. Thank you.