Experimental option -L added to vg deconstruct in order to cluster similar allele traversals together. The value given is a (length-weighted) threshold for the jaccard coefficient between the oriented nodes of two traversals. So if -L 0.75 is given, then alleles that have >= 0.75 similarity based on their graph positions will be merged into one. Two new FORMAT fields are added to keep track of the difference, TS (jaccard distance) and TL (length difference). Clustering is done greedily starting with selected reference paths.
Description
I don't think the clustering is especially useful on its own (though it can be used to make much simpler VCFs), but it's an important part of improved multi-level vcf support (finally coming in next PR). (it's also why I'm clustering on graph path and not the actual dna string, since only the paths themselves can be used to anchor the child snarls)
Changelog Entry
To be copied to the draft changelog by merger:
-L
added tovg deconstruct
in order to cluster similar allele traversals together. The value given is a (length-weighted) threshold for the jaccard coefficient between the oriented nodes of two traversals. So if-L 0.75
is given, then alleles that have >= 0.75 similarity based on their graph positions will be merged into one. Two new FORMAT fields are added to keep track of the difference,TS
(jaccard distance) andTL
(length difference). Clustering is done greedily starting with selected reference paths.Description
I don't think the clustering is especially useful on its own (though it can be used to make much simpler VCFs), but it's an important part of improved multi-level vcf support (finally coming in next PR). (it's also why I'm clustering on graph path and not the actual dna string, since only the paths themselves can be used to anchor the child snarls)