lh3 / miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
MIT License
299 stars 68 forks source link

improving assembly graph contiguity ("connected component N50") #33

Closed ekg closed 6 years ago

ekg commented 6 years ago

In many assemblies we are finding that assembly graph contiguity is not much higher than that of contigs extracted from it. This is true not just for miniasm but also other assemblers.

Why might this be, and is there any way to improve contiguity of the graph?

A big hit to the N50 of the output would be acceptable in exchange for an improved "connected component N50".

lh3 commented 6 years ago

Why might this be, and is there any way to improve contiguity of the graph?

For short reads, contigs mostly stop around repeats. Without increasing read lengths, it is not possible to get longer contigs.

ekg commented 6 years ago

You mean that the repeats are filtered out by the mapping scheme? I don't want longer contigs, but a more contiguous overlap graph. I understand this could be expensive.

ekg commented 6 years ago

In an abstract string graph, we have one copy of each repeat longer than our read length in the graph. The graph is fully connected. We can't make contigs that span the repeats (because of the bifurcation), but the alternatives are represented in the graph without loss of information w.r.t. the reads.

In practice I understand that this does not work so simply, because repeats are not identical and reads have errors, so you don't get one segment for each repeat sequence, but rather a messy structure that is hard to work with. I'm curious if it is possible to build a graph with these tools that includes that mess, because I want to understand under what conditions it is still potentially a useful object.

lh3 commented 6 years ago

For short reads, local topologies around ALTs, LINEs and alphoids are beyond fix. I believe most assemblers simply throw sequences away in this case. This is evident from contig-to-genome alignment.

My opinion is to forget about the issue. You can't fix it with short reads and it is not worth investing time on a fix given the availability of long reads.

ekg commented 6 years ago

@lh3 we're seeing this issue with long reads and miniasm.

lh3 commented 6 years ago

The issue is exactly the same: assemblers tend to consistently break contigs in some regions. You can't expect miniasm or other assemblers to fix the issue for you. The only solution is to get even longer reads.

ekg commented 6 years ago

Is the breakage done for the sake of overall contig N50? Or is this typically just a way to decrease the computational complexity of subsequent steps after overlap generation?

cschin commented 6 years ago

@lh3 @ekg there are some case the ambiguities could be "local". Namely, a local tandem (or non-tandem) repeats causes ambiguous paths. Most assembler breaks contigs regardless the nature of the ambiguity. However, I did some work to identify such local ambiguities (local hairballs) and try to connect the contigs in-and-out of such small local hair balls. (you can see some examples, e.g. the AMY gene regions in this slidedeck https://speakerdeck.com/jchin/what-after-getting-awesome-contigs-challenges-and-opportunities.). In the human assembly example I showed, we actually have better contiguity than the initial contigs. I was very enthusiastic to pursuit some work along this line (for more complicated genomes than human).

ekg commented 6 years ago

@cschin thanks for the slides, this is in line with what I'm getting at.

However, it is not just local ambiguity that I am interested in. I want to use the full-information "hairball" string graph, but it does not seem that we can generate this with any of the available tools. We are not seeing this in assemblies from the VGP. It would seem that high-copy regions are filtered out by all the long read assemblers we have tried, with the result being that the initial assembly graph is a shattered mix of relatively short contigs with some local ambiguities retained. @mcshane may be able to share some figures he has made with Bandage to demonstrate this. CC @richarddurbin.

The hairball may seem irrelevant if we are interested in generating long contigs. But I don't care so much about getting long contigs because they can be lossy and do not provide any benefit other than the linearization of the system. Graph based genomics methods do not require the linearization of the reference system, and so maybe we can do without it for many uses. I want a full-information object that can be used downstream for various applications, such as 1) assembly polishing (equivalent to variant calling or genotyping), 2) as a reference system for resequencing, 3) or assembly graph merging and pangenome generation. Ideally, assembly pipelines should operate on these full information graphical objects as long as possible, rather than utilizing a lossy projection into contigs for downstream operation.

Is there any assembler or parameterization of an assembler that would allow the generation of something approximating the hairball string graph using long reads? At very least I'd like to explore this line but it has been difficult to obtain relevant input data structures to work with.

lh3 commented 6 years ago

From @richarddurbin:

I fully support the goals articulated here, to get a full copy of the assembly graph that we can work with, not one already cut at all the difficult places. The Falcon graphs seem to be already cut. I am not so sure about miniasm.

For large genomes, it is practically difficult to retain all information. At the overlapping step, we typically throw away k-mers with excessively occurrences. This leads to losses of overlaps in long satellites. Miniasm currently doesn't throw away more overlaps, but that is just because it is not intended for large genomes. For large genomes, I bet it has to introduce some thresholds to drop data. Assemblers discard data primarily for the sake of performance/memory, not for N50. Satellite arrays only consist of ~3% of human genome, but resolving them takes far more computing resources than the rest of genome.

There are also other problems in addition to performance. When there are multiple overlaps between two reads – think about satellite arrays – many if not most assemblers choose one overlap. This is a source of information losses. Furthermore, the final assembly requires error correction, often both before and after the assembly. In satellite arrays, you don't know how to do the correction given the 13% error rate. Without error correction, you will pile the uncorrected reads together without actually assembling them.

From a pure theoretical point of view, I think all these issues may be resolvable. For example, you can use EM to distribute reads and do error correction. However, there are more important problems in genome assembly (e.g. diploid genome), which are hard but probably are still much easier to resolve than issues with satellites.

cschin commented 6 years ago

@ekg FALCON does keep a copy of string graph in a very generic text file format and the graph-to-sequence was carried out in very latest stage. In fact, in the diploid assembly part, the initial string graph is re-used to ensure consistency between the different stages of the assembly. You can find some rough document about the format at https://github.com/PacificBiosciences/FALCON/wiki/Manual#the-2-asm-falcon-directory. I wrote some simple script to convert it the GFAv1 once. I also know there is some updated code from PacBio for the conversion. (The code for my hack on visualization can be found https://github.com/pb-jchin/ASTER, unfortunately, I was not able to make it a serious project for various reasons.) However, I was hoping PacBio's software engineers will help to make it directly to GFAv2.

I think @lh3 already said it well. We all agree there is significant loss of information from the assembly graph to config. We also need to filter out some overlaps to make the assembly graph more managable. I am all for making the assembly graph first class citizen. However, there is a gap between algorithm focused bioinformatists and biological focus bioinformatists. Without some GUI level tools, it is hard to make people who don't process raw (text, preliminary) data to care about the loss of the information. In the mean time, it will be a significant investment to build good GUI tools for graph exploration.

mcshane commented 6 years ago

These are a couple of the graphs we are getting from our fish Falcon assemblies at the moment (work-in-progress).

fspaaur1 fmasarm1 fanates1

lh3 commented 6 years ago

BTW, I also have a falcon->gfa1 converter here, along with converters for spades, supernova, abyss, sga, velvet and soapdenovo.

mcshane commented 6 years ago

Thanks @lh3. I did use that for a while then switched to the Falcon gfa conversion script at some point. I forget why. Maybe to get the sequence into the S lines.

ekg commented 6 years ago

We are wondering if the --min_idt option in fc_ovlp_to_graph might be something we want to change from the default in order to get a more contiguous graph. It's set at 96% by default, which seems pretty high. Sorry this is now about falcon, so maybe this is the wrong forum. In effect, I'd like to know if there is a similar parameter for miniasm.

olekto commented 6 years ago

@ekg Then you need to change the ovlp_HPCdaligner_optionoptions also I think. For the options I have seen from different config files, it seems to only calculate overlaps that that are at least 96 % (-e.96) identical for -l length. If I understand the options correctly, which could be doubtful, I guess you could adjust -e.96 to -e.90, and then use different --min_idt to explore the effect. I think, and hope, that options to fc_ovlp_to_graph can be used to filter the results of ovlp_HPCdaligner_option, that is, use short -l and low -e and then use the options --min_idt and --min_len of fc_ovlp_to_graph to filter the overlaps after they have been calculated.

And sorry about posting Falcon stuff on this forum.

rvaser commented 6 years ago

@ekg, you can print the assembly graph in my layout module https://github.com/rvaser/rala at any time (i.e. right after construction or between any two simplification steps). Although the graph is in CSV format (I am feeding it into Cytoscape), I can implement other formats and make the module a library for easier testing. A sample graph right after creation (and containment removal) from a PB yeast dataset is bellow. untitled