Closed francoiskroll closed 5 years ago
Most likely the assembly is being filtered because your reads contain the entire assembly, how many reads are in the unassembled.fasta results? If you post the header lines that should have all the info. You can adjust the filtering criteria, see the FAQ: https://canu.readthedocs.io/en/latest/faq.html#my-asm-contigs-fasta-is-empty-why if that's the case.
1 read in the unassembled.fasta
. And it is 3545 bp, which is much longer than what I am expecting (around 1000 bp).
And yes, I actually expect most of my reads to contain the entire assembly.
If I remove the filtering (as in the FAQ) with contigFilter="2 0 1.0 0.5 0"
, the contigs.fasta
contains one contig of 3545 bp, which is not correct.
Now, if I add corOutCoverage=100
to the command (in addition to contigFilter="2 0 1.0 0.5 0"
), contigs.fasta
now contains two contigs, the first one is 3544 bp, the second is 1054 bp. The second looks right.
I am not entirely clear what is happening here. How does the filtering change the way canu handles small contigs? And why does outputting more reads make a difference here? Coverage is high (>800X) and very even, and it seems from the FAQ that these are the main reasons why one would tweak these parameters.
(Final command giving two contigs)
canu \
-p con8 -d canucon8 \
-nanopore-raw pool.fastq \
genomeSize=1.5k \
minReadLength=500 minOverlapLength=250 \
contigFilter="2 0 1.0 0.5 0" \
corOutCoverage=100
You can read on the FAQ what the filter changes, it will keep tigs with < the default coverage across it.
What is your actual read input distribution (or better the full Canu run output)? The reason you are getting the 1kb sequence is because you increased the output coverage to 100x. By default Canu will keep the longest as long as they represent the sequence of all the data. That implies your 3.5kb read/contig has similarity to whatever your target it as well. Have you looked at what it is?
Not sure where it is from, it does not give anything in BLAST. From looking at the bed
file, I do have a few longer reads at other places, probably just some non-specific PCR products or contamination. 3 kb is longer than the genome size I set though, I thought canu would look at the genome size and not output this? The full output is attached.
Would it make sense to only run canu on only the chromosome/window I am interested in?
Thanks for your time!
Genome size doesn't matter for anything except setting some default resource parameters and reporting statistics. Reads/assembly won't be filtered based on genome size. Your input has lots of data >1kb (about 70% reads and the majority of all bases) so it's not surprising you're getting assembly of that data. I expect the 3kb contig is based on these longer reads you input into the assembler.
If you know for sure none of the data >1kb is what you're interested in you can throw out any reads >1kb before running Canu.
Apologies, I finally understood what is happening with the longer (3.5kb) contig!
My reads are minION data. I actually have hits in BLAST for the 3.5kb contig, all phage genomes. It is the calibration strand provided as positive control by Oxford Nanopore! I missed it previously as I was only looking at the alignment to the human genome and the BLAST results against the human genome too. It is actually an impressively good assembly of it (99.10% identity to phage lambda)!
I threw out all reads > 2kb before alignment and got only the ~ 1 kb contig I am interested in.
Thanks!
I am very confused! The command I was using yesterday does not seem to work today.
I am computing the consensus assembly of MinION reads of ~1kb at 800x coverage.
My command is:
I would swear this was running like a charm, and today it outputs an empty
contigs.fasta
file.Few clues:
unitigs.fasta
is empty too.unassembled.fasta
contains 5 reads which look roughly the right size.Can you help?