Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`)

LijMeh commented 6 months ago

I've been running Roary in order to use the presence/absence tables as input for training ML models (to identify sub-species groups based on shared genomic features). As I don't really care if genes are paralogs, I included option -s, but, because I do care if the proteins differ by even one amino acid, I set -i 100.

My understanding is that this should cause the output from Roary to capture all variants of every gene as their own "cluster" (while ignoring the syntany of the gene), allowing me to determine which strains have which unique variants. However, in practice, I'm frequently seeing genes grouped into clusters that aren't 100% identical (via blastp).

I've verified this by pulling genes out using the query_pan_genome function, by manually extracting genes based on their location in the .gff files, and by nucleotide blast followed by translating to a protein.

If anyone has any ideas of why this might be the case that would extremely helpful, as, for the genes that aren't grouping correctly I can't seem to find any logic as to why they might be clustered together (based on my inputs).

As an aside, the only thing I've found that seems to correlate with this pattern (but not perfectly), is that proteins that are mis-grouped occasionally have similar reading frame directions (ie. + or - in the .gff file), so one group might end up being filled with proteins that are generally read on the - strand, and another that are generally read on the + strand (but never a 100% split).

memoriasresiduais commented 5 months ago

i have the same problem: i could detected different protein variants in the same roary group, despite using -i 100 -s...

LijMeh commented 5 months ago

Thanks for the comment, glad to see I'm not the only person having this issue. From my preliminary testing it seems like Roary only starts working "correctly" when you go down to a 98% threshold. I'm currently working on a tool that independently validates Roary's results (as I need to use it for a paper but obviously don't fully trust it anymore). I'd be happy to share that once it's complete.

memoriasresiduais commented 5 months ago

I also tried Panaroo and same problem: proteins with >98% pooled in the same family, despite -i 100.... Happy to try your tool for validation!

LijMeh commented 5 months ago

For sure, will share once it's finished (hopefully in the next couple of weeks).

In the meantime, have you created a GitHub issue on the Panaroo page? It looks like they're pretty responsive as the program is still being maintained/developed. (I'll probably run my data through it too, but might take a bit to verify I'm having the issue on my end)

memoriasresiduais commented 5 months ago

Hi LijMeh, i just tried panacota and it seems to be pooling my gene families properly using a cut-off of 100% aa identity (-i 1) (https://aperrin.pages.pasteur.fr/pipeline_annotation/html-doc/usage.html#pangenome-subcommand) I'll do more checks though to be sure. feel free to reach me directly, if you wish good luck, m memoriasresiduais@gmail.com

memoriasresiduais commented 5 months ago

Just to add for the records: on my roary analyses w/ -i 100 -s, roary was not only pooling different protein variants in the same roary group, but also placing identical proteins in different groups... no idea why this happens.

sanger-pathogens / Roary

Roary grouping genes that don't meet similarity threshold (using `-i 100`, and `-s`) #616