gtonkinhill / panaroo

An updated pipeline for pangenome investigation
MIT License
245 stars 32 forks source link

Add a new genome to a Panaroo graph #153

Open fabgenomics opened 2 years ago

fabgenomics commented 2 years ago

Hi, I'm using a lot panaroo for a research project. I generated a gene_presence_absence matrix from 1500 bacterial genomes. Then I used this matrix to train a supervised machine learning model. I got a decent accuracy so I wanted to dig in. My interest now is to use the generated graph to include a new genome, extract the data from the new genome and do some prediction. The problem behind the panaroo-integrate command is that all the groups are renamed in an order that is different from the original matrix. For instance, in the whole matrix, the group_5637 represent the gene hemB but when I add a genome with panaroo-integrate the hemB gene is now in group_695. As I'm only keeping some of the group for my trainning and predicting process, I would like to keep them identical when adding a genome. I want the group_5637 to represent the same hemB gene. Is it dificult for you to implement this in the panaroo-integrate code ? Thanks for all your work, Fabien

gtonkinhill commented 2 years ago

Hi,

This is a good point and should hopefully not be too difficult to implement. I will try and get to it as soon as I can and add it to the next release.

In the meantime you may be able to use the geneIDs to keep track of the same clusters between runs. The panaroo-integrate command should maintain the existing clusters which can be identified by the geneIDs within them.

fabgenomics commented 2 years ago

Hi, The problem with the geneID is that I can have 2 different groups with the same geneID. I used the defaut parameters for cluster thresholds plus --clean-mode strict --remove-invalid-genes --merge_paralogs.