labgem / PPanGGOLiN

Build a partitioned pangenome graph from microbial genomes
https://ppanggolin.readthedocs.io
Other
239 stars 28 forks source link

Any Tips on Making a Pangenome on Genetically Disparate Species? #67

Closed joshuakirsch closed 2 years ago

joshuakirsch commented 3 years ago

Hi folks,

Thanks for making this easy to follow tool. I have a collection of genomes from the same family and I would like to make a pangenome out of them. However, these genomes are quit different from one another and this software is unable to finish the partitioning portion of pangenome creation. I keep getting this error:

Exception: Statistical partitionning does not work on your data. This usually happens because you used very few (<15) genomes.

Any tips on how to overcome this? Is this even possible? One of the endpoints I'd like to get to is the presence absence table

axbazin commented 3 years ago

Hi,

I've had this error quite a few times in the past, it is possible to overcome it, but it depends on why it is happening.

How many genomes do you have exactly? And how many clusters to you get at the end of the clustering step? If you do not have this last information, you can get it by running:

ppanggolin info -p pangenome.h5 --content

On the pangenome.h5 file that should have been generated.

For very distantly related genomes (i.e. same genus, or same family) I would recommend lowering the identity threshold for the clustering step, as the default ones are set for relatively closely related genomes. It is likely that your clusters are very sparse and your presence absence table might not be so useful in that case.

For that, you will need to run the tool 'step by step' rather than using the 'workflow' or 'panrgp' commands. For the clustering step, you can check this page: https://github.com/labgem/PPanGGOLiN/wiki/PPanGGOLiN---step-by-step-pangenome-analysis#clustering You'll need to use a command like this on your current pangenome:

ppanggolin cluster -p pangenome.h5 --identity 0.5

For example, if you which to build clusters with an identity threshold at 50%. For genomes belonging to the same family (as in, taxonomy level family) it might be a bit too high, though.

If all you need is the presence absence table without any partitions, or graph, or genomic islands predictions or whatever, you should be able to write it directly after the clustering step, using the following commands:

ppanggolin write -p pangenome.h5 --csv

or

ppanggolin write -p pangenome.h5 --Rtab

depending on whether you wanted the rtab file or the csv file. If you do not know, you can check their formats through this page: https://github.com/labgem/PPanGGOLiN/wiki/Outputs#gene-presence-absence

Don't hesitate to tell me if you need any other information!

Adelme