mattb112885 / clusterDbAnalysis

ITEP - Integrated Toolkit for Exploration of microbial Pan-genomes
26 stars 15 forks source link

Clarification of terms for mcl #64

Closed jcthrash closed 10 years ago

jcthrash commented 10 years ago

Could you recommend a reference or provide some additional information clarifying the minbit/maxbit scoring criteria, and how these impact the cluster distribution? I'm comfortable with the effect of the inflation parameter on cluster granularity, and I've used MCL before in a pipeline context, but I've not come across the scoring criteria previously. Is there a default for MCL? How does changing the scoring criteria affect the clusters at a given inflation parameter? I've been messing around with changing these variables and watching how the total number of clusters change, but I'd love to have a better understanding of the theory rather than just guessing.

Thanks!

mattb112885 commented 10 years ago

Sure. Maxbit and minbit are bit scores normalized by the self bit score for either the query or target protein (which means the bit score for blasting the protin against itself). Maxbit will normalize by the larger of the two self bit scores, which means there has to be sufficient similarity over most of the larger protein (thus using this will tend to exclude gene fragments that are called in some genome and will focus more on strong hits over the whole protein. This is the one we usually use). Minbit normalizes by the smaller of the two self bit scores, which will pull in these fragments (or halves of fusion genes, etc) if they are sufficiently similar to part of the bigger protein.

MCL doesnt hav ea default but the options in the mcl program that converts blast results to scores all use metrics that dont depend on the protein length, which I have found tends to pull in only weakly related proteins (e.g. lots of unrelated dehydrogenases) because of regions of local similarity. You can get access to one of these using the scoring criterion normhsp which is bit score normalized by hsp length (hsp length is roughly the length of the similar region plus added gaps). Hope this helps.

Matt

jcthrash commented 10 years ago

Thanks for the info! I have also noticed that MCL will pull in very short proteins that don't appear to be "true" homologs to the rest of a given cluster. I've dealt with this in the past by screening for length after the clustering, but yours is a better method I think.