[MCscan] Curious about Single Linkage Clustering

zhaotao1987 commented 5 years ago

Hi Haibao,

From your MCscan tutorial: []https://github.com/tanghaibao/jcvi/wiki/MCscan-(Python-version)

That's it! This calls LAST to do the comparison, filter the LAST output to remove tandem duplications and weak hits. A single linkage clustering is performed on the LAST output to cluster anchors into synteny blocks. At the end of the run, you'll see summary statistics of the synteny blocks.

I wonder how the single linkage clustering method was performed to identify synteny blocks. It seems to me a very simple and novel approach. Is it still related to DAGchainer? It would be helpful if you could help to point out the related segment from your scripts.

Thank you very much!

Best, Tao

tanghaibao commented 5 years ago

@zhaotao1987

Code section is here: https://github.com/tanghaibao/jcvi/blob/master/jcvi/compara/synteny.py#L350

Basically all pairs that are within certain threshold along either x-axis or y-axis are clustered into the same group using the Grouper data structure.

The single linkage clustering method has a weaker assumption than DAGchainer (which assumes collinearity). However, to use the simpler single linkage clustering, one needs to perform pre-processing more thoroughly. The current recommended pipeline (as implemented in python -m jcvi.compara.catalog ortholog) filters:

tandem repeats (that are within 10 genes away)
C-score cutoff to remove weaker hits

With these filters in place, we often don't need to rely on collinearity to remove noise any more, hence the simpler single linkage clustering.

Haibao

zhaotao1987 commented 5 years ago

@tanghaibao Thanks very much for your explanation. Best, Tao

tanghaibao / jcvi

[MCscan] Curious about Single Linkage Clustering #125