Open sarahbuzz opened 6 years ago
I don't know the answer, but perhaps paralog genes might be the problem or bug? Not sure!
How would the split-paralogs flag affect that? I would expect paralogs to be in separate clusters if the split paralogs is flagged.
I also have cases where the weight is less than 1 but when I invert that to see what the number of isolates was, I get a number that isn't right. For example, one weight was 0.032258. I then did 1 divided by that number to get the number of isolates and I get 31. I checked how often those two gene pairs are next to each other in our population set and it's much higher than 31.
I went into the OrderGenes.pm code and printed the denominator of the weight calculation and they are floats, not integers. Is there any other criteria your team used to calculate the denominator of the weight equation other than frequency?
Hi Sarah, Sorry that the code isnt the clearest, and looking back now after 3 years I realise I should have documented it a bit clearer. Its great to hear your looking at the graphs, because I think this is the most interesting (untapped) output of roary!
Higher numbers = longer path = 2 genes less commonly found beside each other on contigs. Lower numbers = shorter path = 2 gene more commonly found beside each other on contigs.
To reduce the impact of over represented genotypes in a sample set, a clustering is performed with cd-hit, and each cluster/genotype gets a weighting (in AccessoryClustering.pm). For example if you have 100 samples, but 90 come from a single outbreak this would skew your results. So any samples from the single outbreak are down weighted and only contribute 1/90. These weightings carry over to the graph, so that the contribution of theses over represented samples doesnt bias the graph.
If you think the method could be improved, please feel free to submit a pull request. Regards, Andrew
Hi Andrew, Thank you for the explanation! Ok I think I understand now. You defined outbreaks as those that map to the same cluster in CD-HIT? Then you are using the representative sequences to create the graph nodes and edges then using the frequency across the representatives to calculate the weight?
Thanks and I agree, this graph is a great resource! Sarah
Yes thats correct.
Great, thank you for your help Andrew!
Andrew, I do have another question. If we are inverting the frequency, I'm confused how we can get weights larger than 1. I printed the denominator in the OrderGenes.pm and noticed that they are floats.
I'm confused how a frequency can be a float in this case when calculating the weight?
Hello,
I'm attempting to use the DOT file created by Roary for some downstream analyses (great addition and thank you for writing it to a file!). I was looking through the weights that were calculated and I'm getting very confused so please tell me if I'm misunderstanding something!
I read through the Wiki and the Supplemental information and was under the impression that the weight of each edge is calculated by 1 divided by the number of isolates that have those two nodes (genes) next to each other in their respective genomes. I see weights greater than 1 very often and I'm confused on how any weight can be greater than 1 if your dividing 1 by an integer.
I would appreciate any guidance on how to interpret the weight calculation!