graph-genome / graph_summarization

Browser for Graph Genomes built with VG based on Graph Summarization to provide semantic zoom. As a user zooms in on a graph genome, the topology becomes more complex. Provides visualization for variation within a species of plant or animal. Designed to scale up to thousands of specimens and provide useful visualizations.
Other
7 stars 1 forks source link

Roadmap: Graph Summarization #3

Open josiahseaman opened 5 years ago

josiahseaman commented 5 years ago

Implement a method of summarizing the variation inside of a graph.

Prototype Progress

josiahseaman commented 5 years ago

My Goal for this week is adding sequence and sequence identity penalties into my summarization prototype.

josiahseaman commented 5 years ago

I've read "Distribution of recombination crossovers and the origin of haplotype blocks" (Wang et al. 2002) and "The structure of haplotype blocks in the human genome" (Gabriel 2002). The first paper points to the 4 Gamete Test for recombination points outlined in (Kaplan 1985). The approach appeals to me because it's very binary and easily implemented. We'd eventually want to upgrade to (Gabriel 2002) method. The 4 Gamete Test is clearly overly sensitive, however that's precisely what we'd want for the first level of summarization. Everything that can be easily dismissed gets dismissed first, leaving us with a pile of candidates we can devote compute power to telling if they're false positives / erroneous.

I'm currently working on a generalization of their approach to graph genomes with an arbitrary number of paths and arbitrary number of alternatives. I think it's possible to make a 5 gamete test, a 6 gamete test and so on that may serve as zoom layers. I'll keep working on it for now.

image

josiahseaman commented 5 years ago

May also be relevant: image

From 2004 Paper: https://academic.oup.com/bioinformatics/article/21/2/263/186662

josiahseaman commented 5 years ago

Haploblocker

http://dx.doi.org/10.1534/genetics.119.302283 https://www.genetics.org/content/genetics/early/2019/05/31/genetics.119.302283.full.pdf

Haploblocker looks like they have a lot of reusable concepts and methods. Figure 2 (below) looks very similar to the graph summarization algorithm that I specced out and prototyped for graph genomes. The key unexpected similarity is that they use SNP data to build a graph with slices that represent all the variation. So from dissimilar data, they build essentially the same data structure as us. I think this would work on full sequences.

Haploblocker Fig2 - Graph Summarization

Figure 5 looks very similar to a glimpse of what I was proposing to do with sorting and aggregating Paths into Ribbons. Bernardo (Earlham) pointed out that a dendrogram of similar paths was necessarily local, not global. This figure is a great depiction of why. TubeMap gives us the option to gradually rearrange ordering as we go, so we could conceivable keep fragmentation in check. Though we need to represent that graphically if we're allowing paths to migrate from one ribbon color to another.

Haploblocker Fig5 - Sorting paths by local coherence visualization

Computing Time

"with the full dataset 895 (501 haplotypes, 80,200 SNPs) needing 55 seconds on default, 75 896 seconds with a target coverage and 13.3 minutes in the adaptive 897 mode." Our dataset is roughly 1,000 haplotypes and 150,000,000 bp 'SNPs' except that much of the sequence doesn't have any variance (99%), so we could likely cut it down by 100x. 1,500,000 * 1,000 = (1500000000 / (501 × 80200)) = 37x larger dataset. Meaning 37 minutes to 8 hours for our complete dataset. Those are still very reasonable times assuming this scales linearly. But the authors say the scaling is likely better than that: "The marker density only had a minor effect. Even a panel containing just every tenth marker, on average, needed 99.3% of the computing time of the full dataset."

They appear to agree on the applicability of the method to sequence.
"A future topic of research is the explicit inclusion of larger structural variation like duplications, insertions or deletions as is done in methods to generate a pangenome (Eggertsson et al. 2017). Since blocks in HaploBlocker are of large physical size most structural variation should still be modelled implicitly and an application to sequence data is perfectly possible."

josiahseaman commented 5 years ago

Adapting Haploblocker method to Graph Genomes

Many of the concepts transfer 1:1. I suppose "Great minds think alike." Simple Merge (SM) = Merge Horizontal Split Group (SG) = Split Vertical Neglect Node (NN) = Merge Vertical Extended-block-identification = Migrate Paths (which I had also marked optional)

The question of what order to do the operations in is answered by the paper: "we first alternately apply SM and SG until no further changes occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of the window cluster."

HaploBlocker MCMB may actually serve as a decent zoom level. The most zoomed in would have zero Block-filtering. As we zoom out, we increase MCMB and save each intermediary state along the way of summarization through simplification. This is a sleek solution to an arbitrary threshold: "without prior information, we recommend to set a target on what share of the SNP dataset is represented by at least one block (“coverage”)".

Notable Deviations

Defining Slices

Slices are necessary to Translate MSA / Haplotype Methods to a Graph Genome. Slices are a set of nodes which can be reasonably compared with each other and considered “alternatives” in a branch of the graph genome. These correspond to nucleotides in a vertical slice of an MSA. In graph genomes, it’s a bit more complicated, so here is a proposed technical definition. In order for a set of Nodes to be in the same slice they need to:

  1. Be placed in structurally equivalent positions
  2. Be non-diverging i.e. in the same bubble. Large translocations are divergent.

Programmatically this can be determined by:

  1. From Node B, step upstream to Node A and take all nodes downstream of A to find Node C. This can be an irregular stepsize, meaning that later summarization steps after horizontal merges will bring a node C into the same slice with B when previously that was not shared.
  2. From Nodes B and C, iterate downstream 10 steps and every node ID into a set. If the two sets share a member node ID it means they share a downstream node and are part of the same bubble. Maximum bubble size is purposely defined within some limited view window. Higher levels of summarization means each node is more sequence and thus the bubble size in nucleotides will be larger. This will result in large bubble alternative not being compared until near the last stages of summarization when working with large structural features.

Deletions: An issue to address is the absence or deletion variants that will alter what “1 node downstream” means. If this is a large deletion, then they shouldn’t be compared, but small deletions are an alternative without a representative node.

ekg commented 5 years ago

The slices concept is a limited version of that of bubbles. See "Superbubbles, Ultrabubbles and Cacti" https://www.biorxiv.org/content/10.1101/101493v1.

The most general version of these structures are called "snarls". In vg, there is a method to decompose the graph into a set of snarls. However, this implementation has some scaling problems. The snarl tree would be the thing you'd like to use here.

On Wed, Jul 3, 2019, 08:19 Josiah Seaman notifications@github.com wrote:

Adapting Haploblocker method to Graph Genomes

Many of the concepts transfer 1:1. I suppose "Great minds think alike". Simple Merge (SM) = Merge Horizontal Split Group (SG) = Split Vertical Neglect Node (NN) = Merge Vertical Extended-block-identification = Migrate Paths (which I had also marked optional)

The question of what order to do the operations in is answered by the paper: "we first alternately apply SM and SG until no further changes occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of the window cluster."

HaploBlocker MCMB may actually serve as a decent zoom level. The most zoomed in would have zero Block-filtering. As we zoom out, we increase MCMB and save each intermediary state along the way of summarization through simplification. This is a sleek solution to an arbitrary threshold: "without prior information, we recommend to set a target on what share of the SNP dataset is represented by at least one block (“coverage”)". Notable Deviations

  • Multiple paths can share the same node, so I'm not sure whether their block overlap concept is the same or not.
  • Block-identification is a large scale Horizontal Merge. It looks like a second, high level way of removing uninformative variation. It's unclear at this time if we will include this or not.
  • Block-extension: the choice of 1 in 20 windows having a mismatch is another parameter we may associate with zoom.
  • Visual Ordering: Figure 5 (above) shows a gradual decay of coherence because ordering is strictly preserved. Our advantage is we can use TubeMap path crossovers to keep the appearance of blocks looking coherent. It'd be good to know how much of this decoherence is due to set membership and how much of it is due to simple limitations of 1D sorting.

Defining Slices

Slices are necessary to Translate MSA / Haplotype Methods to a Graph Genome. Slices are a set of nodes which can be reasonably compared with each other and considered “alternatives” in a branch of the graph genome. These correspond to nucleotides in a vertical slice of an MSA. In graph genomes, it’s a bit more complicated, so here is a proposed technical definition. In order for a set of Nodes to be in the same slice they need to:

  1. Be placed in structurally equivalent positions
  2. Be non-diverging i.e. in the same bubble. Large translocations are divergent. Programmatically this can be determined by:
  3. From Node B, step upstream to Node A and take all nodes downstream of A to find Node C. This can be an irregular stepsize, meaning that later summarization steps after horizontal merges will bring a node C into the same slice with B when previously that was not shared.
  4. From Nodes B and C, iterate downstream 10 steps and every node ID into a set. If the two sets share a member node ID it means they share a downstream node and are part of the same bubble. Maximum bubble size is purposely defined within some limited view window. Higher levels of summarization means each node is more sequence and thus the bubble size in nucleotides will be larger. This will result in large bubble alternative not being compared until near the last stages of summarization when working with large structural features.

Deletions: An issue to address is the absence or deletion variants that will alter what “1 node downstream” means. If this is a large deletion, then they shouldn’t be compared, but small deletions are an alternative without a representative node.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/3?email_source=notifications&email_token=AABDQENBLY4MI5JMQLWAP7LP5SKMLA5CNFSM4HW4RTGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZEII7Y#issuecomment-508068991, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEIXL6ZEY7AJ2PXOHQ3P5SKMLANCNFSM4HW4RTGA .

josiahseaman commented 5 years ago

Thank you for the tip Eric! I was thinking of slices as a very limited scope of the beginning of a bubble. But if we can get a local snarl at any point in the graph, that's terrific. I'll have to look at the technical details to see how far that can be taken. I was hoping to do it iteratively while summarizing.

On Wed, Jul 3, 2019, 17:16 Erik Garrison notifications@github.com wrote:

The slices concept is a limited version of that of bubbles. See "Superbubbles, Ultrabubbles and Cacti" https://www.biorxiv.org/content/10.1101/101493v1.

The most general version of these structures are called "snarls". In vg, there is a method to decompose the graph into a set of snarls. However, this implementation has some scaling problems. The snarl tree would be the thing you'd like to use here.

On Wed, Jul 3, 2019, 08:19 Josiah Seaman notifications@github.com wrote:

Adapting Haploblocker method to Graph Genomes

Many of the concepts transfer 1:1. I suppose "Great minds think alike". Simple Merge (SM) = Merge Horizontal Split Group (SG) = Split Vertical Neglect Node (NN) = Merge Vertical Extended-block-identification = Migrate Paths (which I had also marked optional)

The question of what order to do the operations in is answered by the paper: "we first alternately apply SM and SG until no further changes occur. Next, we apply the sequence of NN, SM, SG, SM until fixation of the window cluster."

HaploBlocker MCMB may actually serve as a decent zoom level. The most zoomed in would have zero Block-filtering. As we zoom out, we increase MCMB and save each intermediary state along the way of summarization through simplification. This is a sleek solution to an arbitrary threshold: "without prior information, we recommend to set a target on what share of the SNP dataset is represented by at least one block (“coverage”)". Notable Deviations

  • Multiple paths can share the same node, so I'm not sure whether their block overlap concept is the same or not.
  • Block-identification is a large scale Horizontal Merge. It looks like a second, high level way of removing uninformative variation. It's unclear at this time if we will include this or not.
  • Block-extension: the choice of 1 in 20 windows having a mismatch is another parameter we may associate with zoom.
  • Visual Ordering: Figure 5 (above) shows a gradual decay of coherence because ordering is strictly preserved. Our advantage is we can use TubeMap path crossovers to keep the appearance of blocks looking coherent. It'd be good to know how much of this decoherence is due to set membership and how much of it is due to simple limitations of 1D sorting.

Defining Slices

Slices are necessary to Translate MSA / Haplotype Methods to a Graph Genome. Slices are a set of nodes which can be reasonably compared with each other and considered “alternatives” in a branch of the graph genome. These correspond to nucleotides in a vertical slice of an MSA. In graph genomes, it’s a bit more complicated, so here is a proposed technical definition. In order for a set of Nodes to be in the same slice they need to:

  1. Be placed in structurally equivalent positions
  2. Be non-diverging i.e. in the same bubble. Large translocations are divergent. Programmatically this can be determined by:
  3. From Node B, step upstream to Node A and take all nodes downstream of A to find Node C. This can be an irregular stepsize, meaning that later summarization steps after horizontal merges will bring a node C into the same slice with B when previously that was not shared.
  4. From Nodes B and C, iterate downstream 10 steps and every node ID into a set. If the two sets share a member node ID it means they share a downstream node and are part of the same bubble. Maximum bubble size is purposely defined within some limited view window. Higher levels of summarization means each node is more sequence and thus the bubble size in nucleotides will be larger. This will result in large bubble alternative not being compared until near the last stages of summarization when working with large structural features.

Deletions: An issue to address is the absence or deletion variants that will alter what “1 node downstream” means. If this is a large deletion, then they shouldn’t be compared, but small deletions are an alternative without a representative node.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://github.com/graph-genome/vgbrowser/issues/3?email_source=notifications&email_token=AABDQENBLY4MI5JMQLWAP7LP5SKMLA5CNFSM4HW4RTGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZEII7Y#issuecomment-508068991 , or mute the thread < https://github.com/notifications/unsubscribe-auth/AABDQEIXL6ZEY7AJ2PXOHQ3P5SKMLANCNFSM4HW4RTGA

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/3?email_source=notifications&email_token=AARG2FDGJ7LLIILINIKOUGDP5TGEBA5CNFSM4HW4RTGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE6R7Y#issuecomment-508160255, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FGYZQQEULECFSSXBHLP5TGEBANCNFSM4HW4RTGA .

ekg commented 5 years ago

I actually think that the snarl decomposition can be problematic, and I'm interested in a path-based method to determine this. Maybe that's more similar to the slices you're describing. The idea would be to find regions where you have colinear-ish alignment between different paths in the graph. Within these chains, places where they don't match are bubbles, and these can be identified by the start and and end of the bubble. Bigger, complex structures are harder to see, and to some extent get ignored by following this definition.

josiahseaman commented 5 years ago

Complex structures should be ignored by designed because those are areas where we don't want to oversimplify. Slices are for the very common very simple case. I'd like to still keep my graph a graph where it's locally complicated, especially if that is rare.

On Wed, Jul 3, 2019, 18:14 Erik Garrison notifications@github.com wrote:

I actually think that the snarl decomposition can be problematic, and I'm interested in a path-based method to determine this. Maybe that's more similar to the slices you're describing. The idea would be to find regions where you have colinear-ish alignment between different paths in the graph. Within these chains, places where they don't match are bubbles, and these can be identified by the start and and end of the bubble. Bigger, complex structures are harder to see, and to some extent get ignored by following this definition.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/graph-genome/vgbrowser/issues/3?email_source=notifications&email_token=AARG2FCGATI2OXDTI3IZ6X3P5TM6RA5CNFSM4HW4RTGKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZFDREY#issuecomment-508180627, or mute the thread https://github.com/notifications/unsubscribe-auth/AARG2FGCSB5KEM3B5CZPQK3P5TM6RANCNFSM4HW4RTGA .

ekg commented 5 years ago

@josiahseaman how is this all related to the variable length markov chains used in Beagle?

josiahseaman commented 5 years ago

@ekg I wasn't familiar with Beagle. Looking through Browning's 2006 and 2007 papers I can see there's similarities but I'm not 100% sure how to translate it. I'm talking with the author of Haploblocker today, who I'm hoping has more experience with the wider literature than myself. What would you propose we use from the lesson learned with Beagle? If there is an adjustable graph summarization out there, I'd be very happy to just use off the shelf software and close this issue. That would allow us to focus on visualization.

tpook92 commented 5 years ago

A couple of thoughts on the use of HaploBlocker for zooming etc.

Merge Vertical + Data quality

I think merge vertical is much more similar to allowing for 1 error/deviation per 20 markers than what I call neglect nodes. In neglect nodes I am removing haplotypes locally from the cluster. By this rare variants (which will not be in any haplotype block later) are not further considered - not sure you want that. An alternative for this would potentially be to introduce a node that is including all rare variants.

In regard to that there should be a way to account for the data quality. From experience I would assume 2-3% calling errors for low-coverage sequencing (0.5x or similar), <1% for high-coverage (20x+) and 0.2-0.5% on a SNP chip. Further the share of missingness is potentially different – e.g. with 0.5x you would expect 60-70% of all alleles to be called NA even though they are present).

Zooming

For high level zooming the current merging techniques are probably not enough. Looking for long range relations is probably required for that (e.g. group-wise IBD as in HaploBlocker or screening for commonly occurring paths in vg?). As for HaploBlocker this comes with some challenges:

  1. Blocks are overlapping – in the development of HaploBlocker this was never really a concern but this could for example be solved by modification of the block boundaries. E.g. in Figure 4 just let block b2 span from marker 11 to marker 25 https://www.genetics.org/content/genetics/early/2019/05/31/genetics.119.302283.full.pdf
  2. The use of fixed windows as a starting point for the algorithm is not ideal as boundaries of blocks are potentially not on those boundaries. Block extension to negate those problems fully
  3. Currently for each haplotype the same order of markers is assumed. Incorparation smaller structurally variation should be no problem. Large Translocations are not as easy – first of all identify them. In case of such long range blocks there will definitely be case of recombination happening in a block. Allowing for haplotypes to be just in a part of the blocks seems to be against the understanding of a node but still potentially be helpful. Other question here would be potential information loss and how to quantify it – probably not a short term thing. With all the changes and different data structure it most likely makes more sense to start from scratch than use the originally HaploBlocker code - R might not be the best solution for the other parts of the application anyway right ?

Computing time

The dataset I used contained genomic data from a single maize landrace and thereby little genomic diversity. For substainitly more diverse computing time should increase – as only one core was used this should still not be the biggest issue. It was enough for 1000G Humans Phase 3. But there are still lots of potential improvements (mostly in the merging procedure and in case of included NAs) to speed it up if it has to be calculated more than once.

Further

How do you define a “complex structure” – aren’t those like the most interesting to have a graph based representation of? Why not disregard fixated base-bairs – you need them for alignment but for the graph they should not be important or am I missing something fundamental? Or just to show the physical size of regions?

josiahseaman commented 5 years ago

Status Update: Python HaploBlocker is running - 13 August

Torsten and I have been working to develop a Jupyter notebook implementation of the first three steps of HaploBlocker. Input data and tests are feature equivalent. It was very useful to have a working R implementation to compare against for tests.

I then created a Django app and migrated the code and concepts into HaploBlocker/haplonetwork.py and tests.py. Split_groups (which I had conceived of as split vertical) is working but could still use more thorough tests.

The main goal of Graph Summarization is within sight. We still need to integrate HaploBlocker and Graph together as a database application. This will be a pretty major overhaul with it's own issue #23. Once that is done we should have:

Next:

We're well to the point where it would be helpful to have a front end #24 so we can visualize and browse our test data sets.