marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Empty contigs.fasta #1282

Closed francoiskroll closed 5 years ago

francoiskroll commented 5 years ago

I am very confused! The command I was using yesterday does not seem to work today.

I am computing the consensus assembly of MinION reads of ~1kb at 800x coverage.

My command is:

canu \
  -p con -d canucon \
  -nanopore-raw reads.fastq \
  genomeSize=1.5k \
  minReadLength=500 minOverlapLength=250

I would swear this was running like a charm, and today it outputs an empty contigs.fasta file.

Few clues:

Can you help?

skoren commented 5 years ago

Most likely the assembly is being filtered because your reads contain the entire assembly, how many reads are in the unassembled.fasta results? If you post the header lines that should have all the info. You can adjust the filtering criteria, see the FAQ: https://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why if that's the case.

francoiskroll commented 5 years ago

1 read in the unassembled.fasta. And it is 3545 bp, which is much longer than what I am expecting (around 1000 bp).

And yes, I actually expect most of my reads to contain the entire assembly.

If I remove the filtering (as in the FAQ) with contigFilter="2 0 1.0 0.5 0", the contigs.fasta contains one contig of 3545 bp, which is not correct.

Now, if I add corOutCoverage=100 to the command (in addition to contigFilter="2 0 1.0 0.5 0"), contigs.fasta now contains two contigs, the first one is 3544 bp, the second is 1054 bp. The second looks right.

I am not entirely clear what is happening here. How does the filtering change the way canu handles small contigs? And why does outputting more reads make a difference here? Coverage is high (>800X) and very even, and it seems from the FAQ that these are the main reasons why one would tweak these parameters.

(Final command giving two contigs)

canu \
  -p con8 -d canucon8 \
  -nanopore-raw pool.fastq \
  genomeSize=1.5k \
  minReadLength=500 minOverlapLength=250 \
  contigFilter="2 0 1.0 0.5 0" \
  corOutCoverage=100
skoren commented 5 years ago

You can read on the FAQ what the filter changes, it will keep tigs with < the default coverage across it.

What is your actual read input distribution (or better the full Canu run output)? The reason you are getting the 1kb sequence is because you increased the output coverage to 100x. By default Canu will keep the longest as long as they represent the sequence of all the data. That implies your 3.5kb read/contig has similarity to whatever your target it as well. Have you looked at what it is?

francoiskroll commented 5 years ago

Not sure where it is from, it does not give anything in BLAST. From looking at the bed file, I do have a few longer reads at other places, probably just some non-specific PCR products or contamination. 3 kb is longer than the genome size I set though, I thought canu would look at the genome size and not output this? The full output is attached.

Would it make sense to only run canu on only the chromosome/window I am interested in?

Thanks for your time!

github_canu.txt

skoren commented 5 years ago

Genome size doesn't matter for anything except setting some default resource parameters and reporting statistics. Reads/assembly won't be filtered based on genome size. Your input has lots of data >1kb (about 70% reads and the majority of all bases) so it's not surprising you're getting assembly of that data. I expect the 3kb contig is based on these longer reads you input into the assembler.

If you know for sure none of the data >1kb is what you're interested in you can throw out any reads >1kb before running Canu.

francoiskroll commented 5 years ago

Apologies, I finally understood what is happening with the longer (3.5kb) contig!

My reads are minION data. I actually have hits in BLAST for the 3.5kb contig, all phage genomes. It is the calibration strand provided as positive control by Oxford Nanopore! I missed it previously as I was only looking at the alignment to the human genome and the BLAST results against the human genome too. It is actually an impressively good assembly of it (99.10% identity to phage lambda)!

I threw out all reads > 2kb before alignment and got only the ~ 1 kb contig I am interested in.

Thanks!