tseemann / prokka

:zap: :aquarius: Rapid prokaryotic genome annotation
850 stars 226 forks source link

Discrepancies in output downstream analyses [inc. roary] #244

Closed TreeT2 closed 6 years ago

TreeT2 commented 7 years ago

Hi Torsten,

Not sure if this is also related to roary also but Id love to hear your opinion. I apologize for this long post in advance.

I want to use roary to compare my core and accessory gene sets. But I also want to look into the genetic content based on previously described accessory genes. I have gone about this in two ways:

1) take the *.ffn files and blast them using a reference database of accessory genes. 2) extract the accessory gene names from roary gene_presence_absence.Rtab, extract the loci tags for each from clustered_proteins and extract the fasta sequences from *.ffn. (using R > linux grep > pyfaidx ) Then blast as above In both instances I remove hits which have less than 50 coverage and 70% identity

I get two totally different profiles in each instance. Its very frustrating considering I have a group of isolates with an accessory gene group present using method 2 but not method 1, and I know they should not have this content. Ive gone back to the input *gff file and gene name is within the file in isolates that should not have this gene. Surely the two methods should produce similar results. Why is there a discrepancy?

andrewjpage commented 7 years ago

Hi, This is a Roary issue I think. What percentage identity did you use for Roary? Andrew

TreeT2 commented 7 years ago

Hi Andrew,

Thanks. I use 90% blastP ID

andrewjpage commented 7 years ago

I'm afraid I dont know what the problem is straight off. My guess is that MCL is doing something unexpected.

TreeT2 commented 7 years ago

To give more helpful info, looking over my roary gene_presence_absence.Rtab file there is a gene eae which crops up 4 times; eae, eae_1, eae_2 and eae_3. All of these are present in the isolates in various degrees. My thinking is that some of these were detected by prokka in complete/ partial forms making them appear different. Roary reads this in the *gff and defines separate gene groups....?

andrewjpage commented 7 years ago

A feature of Roary is that it splits paralogs based on syntany. This can cause identical genes to be split into different groups. You can turn if off by providing '-s', and is one less thing to rule out.

When assigning gene names to clusters Roary just takes the most frequently occurring name for the sequences in that cluster, and all it takes is 1 partial hit to throw things off, sorry.

A bit of good news though, in a short while Sion Bayliss will be releasing software which fixes all of these problems.

TreeT2 commented 7 years ago

Thanks Andrew.

I repeated it again with the -s option and eae_3 has gone. Fingers crossed for the new bit of software.

passdan commented 3 years ago

Hi @andrewjpage (from the future)!

To wrap this up, I assume the new software that you mention Sion Bayliss will be releasing is now released as PIRATE (https://github.com/SionBayliss/PIRATE) which I just hunted out. Thanks for the signpost and will hopefully sort the multiple group split issue.