mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
744 stars 164 forks source link

Read extension jumping to 100% completion #16

Closed jonhultqvist closed 6 years ago

jonhultqvist commented 6 years ago

Hi,

Many thanks for developing this great assembler.

I have recently updated to version 2.0 version of ABruijn and testing it out on relatively complex nanopore metagenomics samples. The eukaryotic assembly portion, which I'm most interested in, comprises 5-30% of the reads in a given dataset. This means that there tends to be a high coverage of several bacterial genomes in the assembly as a consequence. The two datasets that I'm working with are 2.7 Gbp and 16 Gbp. With the new version of ABruijn I noticed that the read extension appears to jump from either 0%, 10% or 30% (example shown below) directly to 100% completion depending on the dataset. The assembly appears to progress normally after this. The final assemblies I'm getting appears to have lower contiguity than earlier versions (pre 2.0) of ABruijn with the same dataset or a dataset of half the size..

Is the jump to completion in read extension expected behaviour with the new version or could there be an indication of a problem with the assembly (lower assembly contiguity)? Sorry if these are vague questions but I have a feeling that the assembler is running into issues, perhaps as a result of conflicts between the given genome size and the estimated coverage by the assembler.

The estimated genome size is around 30 Mbp, with coverage of around 20x in the dataset for the example shown below.

The launch code: abruijn MinION_albacore1.0.3.chop.fasta /scratch2/jon/MinION/MinION_Abruijn_2.0/ 20 --platform nano --threads 10 --min-overlap 3000 --iterations 3


[2017-08-21 12:43:16] INFO: Running ABruijn
[2017-08-21 12:43:17] INFO: Assembling reads
[2017-08-21 12:43:17] INFO: Reading FASTA
[2017-08-21 12:45:28] INFO: Generating solid k-mer index
[2017-08-21 12:45:30] INFO: Counting kmers (1/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-21 12:55:07] INFO: Counting kmers (2/2):
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-21 13:06:59] INFO: Filling index table
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-21 13:23:30] INFO: Extending reads
0% 10% 20% 30% 100%
[2017-08-21 15:56:40] INFO: Assembled 325 draft contigs
[2017-08-21 15:56:40] INFO: Generating contig sequences
[2017-08-21 16:05:24] INFO: Running BLASR
[2017-08-21 16:37:38] INFO: Computing rough consensus
[2017-08-21 18:12:30] INFO: Performing repeat analysis
[2017-08-21 18:12:32] INFO: Reading FASTA
[2017-08-21 18:14:25] INFO: Building repeat graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-21 18:17:48] INFO: Simplifying the graph
[2017-08-21 18:17:49] INFO: Aligning reads to the graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-21 19:02:42] INFO: Resolving repeats
[2017-08-21 19:02:59] INFO: Generating contigs
[2017-08-21 19:02:59] INFO: Generated 461 contigs
[2017-08-21 19:03:05] INFO: Running BLASR
[2017-08-21 19:30:03] INFO: Polishing genome (1/3)
[2017-08-21 19:32:03] INFO: Separating alignment into bubbles
[2017-08-21 21:19:35] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-22 07:16:53] INFO: Running BLASR
[2017-08-22 07:44:43] INFO: Polishing genome (2/3)
[2017-08-22 07:46:42] INFO: Separating alignment into bubbles
[2017-08-22 09:43:09] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-22 15:47:37] INFO: Running BLASR
[2017-08-22 16:14:38] INFO: Polishing genome (3/3)
[2017-08-22 16:16:32] INFO: Separating alignment into bubbles
[2017-08-22 18:15:51] INFO: Correcting bubbles
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2017-08-22 21:48:37] INFO: Done! Your assembly is in file: /scratch2/jon/MinION/Abruijn_2.0/polished_3.fasta
mikolmogorov commented 6 years ago

Hi,

The new version has a bit different read assembly algorithm, which is aimed for a faster assembly. Potentially, it could be confused by the presence of the large amount of "non-target" sequence - I think in your case that algorithm thinks that the assembly is complete too early.

We are now working on a more reliable version of the algorithm, which should be available soon. In the mean time you can try the latest version from master branch, it contain some updates that might fix your problem. What portion of the expected genome size are you currently getting? And could you also post (or send to me) the full abruijn.log?

mikolmogorov commented 6 years ago

Hi,

We've just pushed a new version into master, which in theory should assemble as many sequence, as the 1.0 version. Let me know if you have any issues with it.