sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
321 stars 190 forks source link

core_accessory_graph weights are greater than 1 #405

Open sarahbuzz opened 6 years ago

sarahbuzz commented 6 years ago

Hello,

I'm attempting to use the DOT file created by Roary for some downstream analyses (great addition and thank you for writing it to a file!). I was looking through the weights that were calculated and I'm getting very confused so please tell me if I'm misunderstanding something!

I read through the Wiki and the Supplemental information and was under the impression that the weight of each edge is calculated by 1 divided by the number of isolates that have those two nodes (genes) next to each other in their respective genomes. I see weights greater than 1 very often and I'm confused on how any weight can be greater than 1 if your dividing 1 by an integer.

I would appreciate any guidance on how to interpret the weight calculation!

tseemann commented 6 years ago

I don't know the answer, but perhaps paralog genes might be the problem or bug? Not sure!

sarahbuzz commented 6 years ago

How would the split-paralogs flag affect that? I would expect paralogs to be in separate clusters if the split paralogs is flagged.

I also have cases where the weight is less than 1 but when I invert that to see what the number of isolates was, I get a number that isn't right. For example, one weight was 0.032258. I then did 1 divided by that number to get the number of isolates and I get 31. I checked how often those two gene pairs are next to each other in our population set and it's much higher than 31.

I went into the OrderGenes.pm code and printed the denominator of the weight calculation and they are floats, not integers. Is there any other criteria your team used to calculate the denominator of the weight equation other than frequency?

andrewjpage commented 6 years ago

Hi Sarah, Sorry that the code isnt the clearest, and looking back now after 3 years I realise I should have documented it a bit clearer. Its great to hear your looking at the graphs, because I think this is the most interesting (untapped) output of roary!

Higher numbers = longer path = 2 genes less commonly found beside each other on contigs. Lower numbers = shorter path = 2 gene more commonly found beside each other on contigs.

To reduce the impact of over represented genotypes in a sample set, a clustering is performed with cd-hit, and each cluster/genotype gets a weighting (in AccessoryClustering.pm). For example if you have 100 samples, but 90 come from a single outbreak this would skew your results. So any samples from the single outbreak are down weighted and only contribute 1/90. These weightings carry over to the graph, so that the contribution of theses over represented samples doesnt bias the graph.

If you think the method could be improved, please feel free to submit a pull request. Regards, Andrew

sarahbuzz commented 6 years ago

Hi Andrew, Thank you for the explanation! Ok I think I understand now. You defined outbreaks as those that map to the same cluster in CD-HIT? Then you are using the representative sequences to create the graph nodes and edges then using the frequency across the representatives to calculate the weight?

Thanks and I agree, this graph is a great resource! Sarah

andrewjpage commented 6 years ago

Yes thats correct.

sarahbuzz commented 6 years ago

Great, thank you for your help Andrew!

sarahbuzz commented 6 years ago

Andrew, I do have another question. If we are inverting the frequency, I'm confused how we can get weights larger than 1. I printed the denominator in the OrderGenes.pm and noticed that they are floats.

I'm confused how a frequency can be a float in this case when calculating the weight?