lh3 / gfatools

Tools for manipulating sequence graphs in the GFA and rGFA formats
208 stars 20 forks source link

rGFA vs vg's path model #1

Closed lh3 closed 5 years ago

lh3 commented 5 years ago

I will start with a simple case: compact de Bruijng graph (cDBG). We can construct a cDBG for GRCh38 and keep chromosome paths (more exactly walks). This is a lossless representation of GRCh38, but will we prefer this over the linear genome? No. This cDBG discards the intuition of linearity. A unitig may corresponds to multiple regions in GRCh38, which unnecessarily complicates analyses. cDBG is not a good reference model.

Path model can encode such a cDBG. It semantically allows graphs overcomplicated for for reference. This over-flexibility is a problem. Another issue with the path model is that it doesn't have a concept of sample. Having two similar paths from two samples indicate orthology; having two paths from one sample indicate paralogy/segdup. Without the concept of sample, we can't distinguish the two cases.

Starting with all-pair alignment, seqwish will build a graph akin to A-bruijn, which is affected by the same issues with cDBG. From a practical point of view, a much bigger problem is that it assumes all sequences should be kept in the graph. However, assemblies and alignments have many errors and these errors will contaminate the reference. On the other hand, a reference may be incomplete, but the content in the reference shall be accurate. I have closely worked with GRC and firmly share their view that accuracy is critically important to a reference.

rGFA imposes a strong constraint on a reference model: a linear reference sequence doesn't collapse onto a segment. This constraint keeps the intuition of linearity and simplifies graph operations. It is exactly what we want.

rGFA can be extended to a constrained path model with the addition of Walk lines. It explicitly exposes the concept of sample in this case. However, such an extended model will be more complicated. We don't have algorithms to build such a graph. Building an extended rGFA to base accuracy again assumes contigs are perfectly accurate, but we won't get there in a few years.

ekg commented 5 years ago

I will start with a simple case: compact de Bruijng graph (cDBG). We can construct a cDBG for GRCh38 and keep chromosome paths (more exactly walks). This is a lossless representation of GRCh38, but will we prefer this over the linear genome? No. This cDBG discards the intuition of linearity. A unitig may corresponds to multiple regions in GRCh38, which unnecessarily complicates analyses. cDBG is not a good reference model.

I totally agree with you that this is not an optimal reference graph model, and we have no reason to build graphs like this if we have long range structural information available.

That said, there are very likely cases where this kind of model is extremely powerful. For one, we can determine all multimapping positions for free out of a single alignment to the graph. If my reads cannot distinguish between different copies of a repeat, then perhaps it is not wrong to merge them in the reference structure. There are many practical applications where this might be the best we can do. I provide several examples in my thesis (see 3.4 "Neoclassical bacterial pangenomics"). I don't think we should exclude this by building systems of tools that prohibit this representation.

Path model can encode such a cDBG. It semantically allows graphs overcomplicated for for reference. This over-flexibility is a problem.

The hierarchical coordinate system and the path system can work in parallel. They are not exclusive, and in many ways complement each other. What is exclusive is building the graph in a particular kind of manner. I do not believe this is necessary to obtain the properties you are seeking in terms of its reliability and accuracy as a reference data model.

The path model is not a way to provide a reference coordinate system in which each piece of the graph has a single reference position. You have provided a clean way to do this based on progressive construction. Several other methods to do so exist, and it would also be good to consider them (e.g. the snarl tree from vg and cactus). They will all be able to produce a similar kind of coordinate space decomposition.

There is something that you cannot do in rGFA now, which is to embed multiple references. We have multiple versions of the reference today, and I see no reason why this will cease to be. In fact, many people may continue to use a linear reference for decades to come. We might also want to consider improving linear reference systems based on the information contained in pangenome graphs. Relating multiple reference sequences to the reference graph is already important and useful.

Another issue with the path model is that it doesn't have a concept of sample. Having two similar paths from two samples indicate orthology; having two paths from one sample indicate paralogy/segdup. Without the concept of sample, we can't distinguish the two cases.

The path model can have a concept of sample. We would just need to add an annotation to the P line in the GFA. I'm not sure I follow the problem. The path lines are conceptually the same as alignment records in SAM. It's not been a problem there.

I should note that I've been considering another way to bridge the gap between path containing graphs and rGFA as it now stands. We should be able to include extended cigars on the paths/walks to describe the exact relationship between the graph topology and the paths that we want to relate to it. These don't have to be embedded in the structure of the graph, although perhaps it is useful in some cases to do so.

Starting with all-pair alignment, seqwish will build a graph akin to A-bruijn, which is affected by the same issues with cDBG.

This interpretation is not correct. Starting with any set of alignments between a set of sequences, seqwish will build a graph in which aligned bases (those that are asserted as identical by alignment) are represented once for each set of input sequence bases that have been aligned together. This means that the structure of the input alignment set exactly determines the graph. If these alignments are long, chromosome length, and do not induce collapse, then the resulting graph will have chromosome level components without cycles.

The only free parameter in seqwish limits the number of times that a sequence is included in a single transitive closure in the alignment graph. In other words, it can be set to prevent collapse of any sequences onto themselves. This could be changed so that a given reference sequence never folds on itself. I expect we're going to be testing this out. The alternative is to force alignments to be linear or filter the alignment set itself.

I apologize that there is not yet a publication to clarify the behavior of seqwish, but I am working towards one. My thesis and the seqwish README provide equivalent documentation of the algorithm itself.

From a practical point of view, a much bigger problem is that it assumes all sequences should be kept in the graph. However, assemblies and alignments have many errors and these errors will contaminate the reference. On the other hand, a reference may be incomplete, but the content in the reference shall be accurate. I have closely worked with GRC and firmly share their view that accuracy is critically important to a reference.

I agree with you and the GRC. The reference must be accurate. That's one of the reasons that we might want to construct it from many genomes and remember where they map in it. Then, we can remove pieces of the reference graph that are apparently rare or erroneous among the set of genomes from which we built it. The alternative is to align the input genomes to the constructed graph and to do the same thing. They are two sides of the same coin. I believe the mixing of information in the graph constructed from a mutual alignment has advantages over progressive methods, but in practice there may be no meaningful difference in results.

With seqwish, we can manually curate a set of alignments that's used to build the graph, only accepting those that we fully trust based on human driven curation. I believe that this may also of critical importance to building an accurate reference graph.

rGFA imposes a strong constraint on a reference model: a linear reference sequence doesn't collapse onto a segment. This constraint keeps the intuition of linearity and simplifies graph operations. It is exactly what we want.

I think this may be good for a coordinate system, but isn't really necessary to impose on all related or derivative methods. That's the concern I have: that we'll do something clean for getting consistent reference coordinates, but this will propagate into all uses of the model downstream in ways that perpetuate reference bias problems we now have.

Human genomes contain many CNVs that have large copy number variation ranges. It's fine to ensure that the reference doesn't collapse, but consider that it will may be harder to genotype these sites if the reference isn't collapsed to represent a minimal allele set (akin to the A-bruijn model you mention). This is just one point to note that things aren't always so clear. We may make many kinds of graphs to solve these different problems, depending on what we're interested in.

What is important is having a common coordinate space for the big pieces of the genome. I have understood this as what you are doing with rGFA and minigraph. I think we should support both this base case and these extensions of the graph, which I believe will require us to put paths in the graph.

rGFA can be extended to a constrained path model with the addition of Walk lines. It explicitly exposes the concept of sample in this case. However, such an extended model will be more complicated. We don't have algorithms to build such a graph. Building an extended rGFA to base accuracy again assumes contigs are perfectly accurate, but we won't get there in a few years.

Yes, it can be extended. It's just more line types in the file. I would support that. It makes our approaches formally compatible, even if you may choose not to work with the paths/walks.

I think we roughly agree with each other, and perhaps we are still at a stage where we are discussing things that will become more clear as we work. It's good to make things concrete.

I should be able to provide you an example of a graph built with paths (walks?--- what's the difference again?) in the next few days. That said, I'd like to be working on the same data you're using. What genomes did you put into the graphs that you built for h37 and h38? I've used the ONT genomes assembled in @benedictpaten's lab. They are not scaffolded, which is a problem and hopefully won't reflect the core data that we will use to build reference graphs for humans.

I'd also like to run with similar alignment parameters as you've used. In any case, the graphs I built are based on minimap2, so we should be able to use a similar parameter set for minigraph and seqwish graphs. I see you require a minimum chaining length of 50kb. Perhaps I should set the same. Are you also building each chromosome at a time, or is the fact that you get chromosome components out related to the way that you include new bubbles?

lh3 commented 5 years ago

I will comment on this first:

I think we roughly agree with each other, and perhaps we are still at a stage where we are discussing things that will become more clear as we work.

The key point is: as long as you collapse sequences from a single sample with a seqwish-like algorithm, rGFA won't work. We are on two diverging paths and will be departing further away.

On other points:

If my reads cannot distinguish between different copies of a repeat, then perhaps it is not wrong to merge them in the reference structure.

This is wrong. We should not confuse data analyses with reference models.

There is something that you cannot do in rGFA now, which is to embed multiple references.

It is already written in our proposal that rGFA will have W-lines. It can embed multiple references.

If these alignments are long, chromosome length, and do not induce collapse, then the resulting graph will have chromosome level components without cycles.

Current methods produce collapsed alignments. If you had chr-long alignment, A-Bruijn graph would be much simpler. Seqwish is behaving like A-Bruijn. Furthermore, when you get chr-long alignment, rGFA will work as each base can be uniquely identified in one assembly.

Then, we can remove pieces of the reference graph that are apparently rare or erroneous among the set of genomes from which we built it.

There are far more false or uncertain alignment blocks than true ones with the current methods. It is much cleaner to choose the definitely good ones as HGRC is proposing. In addition, removing segments from a seqwish graph will break path contiguity. It doesn't work with your model.

With seqwish, we can manually curate a set of alignments that's used to build the graph, only accepting those that we fully trust based on human driven curation. I believe that this may also of critical importance to building an accurate reference graph.

Then the rGFA model can work.

This is just one point to note that things aren't always so clear.

If things aren't clear, leave them out. We add them back when technologies and algorithms are there to make them clear.

perpetuate reference bias problems we now have.

Suppose one day we can accurately get chr-long assembly, we can construct a graph from the multi-sequence alignment fits rGFA. That graph is unbiased. rGFA does not necessarily lead to reference biases.

We may make many kinds of graphs to solve these different problems, depending on what we're interested in.

We have one problem to solve: reference model, and we want to find the best solution to this specific problem. I don't care about "these different problems" (e.g. assembly graphs or those solved by split-tree). In general, GFA is a one-size-fit-all model. A reference model has specific requirements. An over-flexible solution will complicate the intended use cases. GFA1 and SAM become popular because they solve the specific problems well. Many other proposals fail when they want to become too generic.

ekg commented 5 years ago

Before going into the following, which are notes and somewhat disorganized, I just want to make a clear point about my perspective on this. I think that it's a good idea to cache reference positions in the data format that's being used. This is a key piece of what people have been missing in terms of understanding how coordinate systems can be used in pangenome graphs. I also strongly support the GAF model. This is a good step. I'm here to provide a counterpoint to what you've proposed, not to work against you.

I think that we disagree on several points related to how flexible the format should be. I feel that there is also basic confusion about what the path model in vg and GFAv1 does or doesn't provide. I'm committed to working together on this to sort things out and get a system that we can build around. I am taking the proposal in rGFA as a starting point, but I don't mean to tear it down and I'm trying to suggest extensions or minor changes that I think are important.

Responses below.

If my reads cannot distinguish between different copies of a repeat, then perhaps it is not wrong to merge them in the reference structure. This is wrong. We should not confuse data analyses with reference models.

If you are talking about coordinate systems, then I can see why that is a separate issue to many analyses. I am arguing this as well, in saying that the coordinate system and graph structure can be independent.

But the reference structure is fundamentally a model of what you expect to see. It's a prior. And if we are inferring a new genome, then this prior is going to be a major part of our analysis.

For instance, in RNA seq analysis, we use a splicing graph as the reference. This is used in hisat2, which is one of the best performing RNA seq aligners, and also in vg rna, not to mention many other implementations going back more than 10 years.

I was merely pointing out that allowing cycles in the graph can simplify CNV analysis. It resolves problems of multimapping and reference/sample copy count mismatch that arise when mapping against a linear model. It also identifies the actual variant in the structure of the graph, structuring our prior to match what we expect to see more strongly than if the graph is fully linear.

It is already written in our proposal that rGFA will have W-lines. It can embed multiple references.

For the sake of compatibility with existing implementations, I would suggest supporting P lines for this purpose. Many implementations (5-10?) already do this. Thus far, we have none that use W lines.

W lines are just P lines transformed into the rGFA coordinate space. We can convert back and forth between them so long as the paths are fully embedded in the graph. If we allow W lines to have extended cigars, then we can be lossless with both. I would propose to allow this. But I note that we have not really used this principle with P lines, and it allows for too much flexibility in the representation of the model, which is a fundamental problem with overlap graph representations in GFA v1 and v2. (It requires non-trivial algorithms to reduce an overlap graph to a blunt ended one. seqwish is one such method, but it has not been applied to this. We have used @benedictpaten's pinch graph model to do so previously.)

W and P lines are also similar to collections of GAF records. The difference is that in W or P lines we have chosen a single alignment, while GAF records can describe multiple mapping possibilities. Converting between these different data types is presumably in the scope of gfatools or related methods.

We have one problem to solve: reference model, and we want to find the best solution to this specific problem. I don't care about "these different problems" (e.g. assembly graphs or those solved by split-tree). In general, GFA is a one-size-fit-all model. A reference model has specific requirements. An over-flexible solution will complicate the intended use cases. GFA1 and SAM become popular because they solve the specific problems well. Many other proposals fail when they want to become too generic.

The difference with these failed proposals is that I'm not proposing anything. It's already being done this way. When I say that we have a solution to the reference coordinate problem in pangenome graphs, I mean that we already are using it. These ideas are already out in the world, and people are building on them.

I think you've heard me on this point, but disagree. I'd like to understand why what we already have in vg/GFAv1 does not constitute a reference pangenome model.

Keep in mind that we are aligning reads to graphs that have very different structures and contents and still getting stable reference-based alignments out (in SAM or BAM). It is a small step to make GAF, which I think is a strong way to generalize a SAM like model to a pangenomic context. We don't need the other features in rGFA. Why should we use them?

If we emit GAF, we can do even better than hierarchical rGFA models at preserving coordinate systems, while using only a subset of the existing GFAv1 standard to represent our graphs.

Assume that our input genomic sequences are immutable--- they won't change if we build the graph differently. So if we use GAF to encode our alignments relative to segments of actual genomes, the alignments will be valid no matter how we build or modify the graph. We can always derive such a projection as long as each part of our graph is covered by some path, and any GAF alignment will be valid in any graph that contains the same genomes that it draws upon.

On the contrary, your current model will not be stable across different constructions of the graph, because the identity of segments is based on their context in the graph and the order of progressive alignment, not their location in the source genomes. As such, GAF alignments to the rGFA model/construction you have put forward are not compatible with different graph models, unless they are extensions of the base graph. I don't see what benefit this inflexibility has. This excludes even the most simple pangenomic representation, which is just a bag of sequences, and is the current standard for publishing and sharing pangenomes. This aspect would fix us into a series of graphs and extensions that could not be changed even if we were to find significant problems with them in the future. (I concede that it should would be possible to augment rGFA to achieve this same effect, by adding information about the origin of each segment in a source set of genomes. I would support this.)

Perhaps you are concerned that a multiplicity of embedded genomes will confuse things. I agree that this could be a problem. In that case we could write GAF using any one of the embedded genomes, and that could be very complex to parse. One solution would be to make some paths precedent over others. Another solution, which is effectively in our proposal and which I have argued is a way for these models to work together in my blog post, was to use progressive rGFA construction to build a set of sequences that cover the large sequences in the graph. These, if retained as embedded paths the graph, will provide a hierarchy of coordinates that we can use to the same effect, irrespective of the particularities of the graph construction.

Relative to the vg/GFAv1 model, you have moved the location of information about reference position from paths to sequence name and offset information (and added rank, which helps describe your coordinate hierarchy). This positional and path name caching is something that we're already doing internally in pangenomic indexes, so it's always made sense to me that it should be serialized as well. This might make it easier for users to access this information and make subsets of graphs. I think this is positive, but I am skeptical that it is good to limit ourselves to one coordinate space.

Other than this caching of reference positions on nodes, the major semantic difference between rGFA and GFAv1 is that rGFA disallows overlaps. This is a good thing, as it reduces the flexibility issue and forces information to be included in one place (in the graph topology) rather in two (topology + alignments). VGs represented in GFAv1 never have overlaps.

lh3 commented 5 years ago

We have discussed some of these over emails and will continue to explore the best model. I am closing this issue for now. I will soon add another issue about the Walk/W-lines. Thanks for your insight.