Use of HaploBlocker principle for zooming

Goal: Reduction of the number of nodes in a variation graph Exemplary numbers using 501 maize DHs (32767 SNPs on chromosome 10).

Derivation of a window cluster – Alternatively just use single SNPs Window Cluster with 6.226 Nodes Single SNP approaches are possible but probably require the use of Merge Vertial // neglect nodes to get rid of calling errors etc and heavily reduce number of markers! For my dataset it was just 7.180 but with more diversity in the dataset I would assume much higher numbers.
Use of Cross-merge / simple-merge ((Merge Horizontal / Split Vertical )) Window cluster with 1.935 Nodes I don’t think there is any reason to not use those two merging techniques
Use of Cross-merge / simple-merge / neglect-nodes (Remove rare variants) Window cluster with 1.444 Nodes Removing rare nodes definitely debateable. I would argue that displaying a node with exactly 1 haplotype in it is not really informative – in HaploBlocker I am even using 5 as a default. This is how the cluster now looks like: Nodes 12 and 14 are not merges as there is one additional haplotype that is only contained in one of the two. But that is relatively rare.

cluster1

Extending the HaploBlocker concept to identify long range haplotype structures The idea here is to identify long range haplotype associations. E.g. In this example all haplotypes in node 3 transition in 8, 10, 12, 14, 20 but no merges are possible as some other variants are also contained in the following nodes 4.1 Derive a haplotype library in HaploBlocker. In case of overlap between blocks split the blocks in to smaller ones with no overlap. 4.2 Substitute path in the window cluster by identified haplotype blocks. Generate a new node and remove all haplotypes from the respective nodes they were in before. Window cluster with 815 nodes. Nodes that were not combined to any haplotype block typically contain only a few and locally similar haplotypes. E.g. Nodes 5,6,7 contain 5,6,5 haplotypes – 5 of which are the same. I would assume that this can easily reduced to ~500 nodes for this dataset. This procedure can be used multiple times in succession with the window cluster of the prior iteration being the input for the next one.

For simplicity, I used HaploBlocker without block or snp extension and did not allow for any of haplotypes to be added to the node of to be excluded in the block extension.

Some more thoughts: Input for the block identification procedure is the window cluster. For each node it is stored with haplotypes are in there and from which / to which nodes the included haplotypes transition. There is no requirement of fixed position – I would even assume that the vg output should be relatively easily translatable into this data structure. Deletions and translocations should be directly implementable. For duplications we could test to use the same node multiple times and store it with different names e.g. 1, 1_Dup1, 1_Dup2, etc – but that requires probably a bit more work.

graph-genome / graph_summarization

Use of HaploBlocker principle for zooming #19