Closed abayega closed 6 years ago
It's not explicitly what it was designed for but could work. You probably would only need the correction step and then use the resulting data for analysis (I assume you don't want to try to assemble the cDNA sequences together). You can lower both read length and overlap length and you'd want to correct all the data (corOutCoverage=all in Canu 1.6).
I will try it and report back. Thanks
It'll be interesting to see what effect Canu trimming would have on the adapter sequences of reads.They'd be likely to have a lot of consistency for reads that spanned the full length of a transcript, so I would assume both adapter ends would be preserved in the trimmed reads (despite appearing in almost every sequence).
It works surprisingly well... and fast
Which Canu version and settings are you using? I am using Canu1.6, on a system with 256Gb RAM, and 48 processors but its now taken 64 hours and counting on 1.5 million reads from RNA-Seq (median length 1kb)
Are you just doing the correction step, or a full assembly?
Correction first, then trimming. Not sure Canu can do this type of assembly. Also, we trim adapter before Canu correction
I get a very slow cormhap
runtime with the following on a 1.539 Gb dataset (1.585M reads), as in 3 days for one of the 86 processes, run on a 12-thread computer with 64GB memory:
~/install/canu/canu-1.6/Linux-amd64/bin/canu -correct -nanopore-raw fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=100 minOverlapLength=50 genomeSize=500M
The precompute step took 13182 seconds.
running with overlapper=minimap
allowed canu to progress much quicker. I've also adjusted the minimum and overlap lengths:
~/install/canu/canu-1.6/Linux-amd64/bin/canu -correct -nanopore-corrected fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=500 minOverlapLength=250 genomeSize=500M overlapper=minimap
precompute: 118 seconds cormhap: 4287 seconds
... except it fails:
----------------------------------------
--
-- MiniMap overlap jobs failed, retry.
-- job correction/1-overlapper/results/000001.ovb FAILED.
-- job correction/1-overlapper/results/000005.ovb FAILED.
-- job correction/1-overlapper/results/000006.ovb FAILED.
-- job correction/1-overlapper/results/000008.ovb FAILED.
-- job correction/1-overlapper/results/000009.ovb FAILED.
-- job correction/1-overlapper/results/000010.ovb FAILED.
--
The errors are not consistent. This one looks like it might be a minimap2 bug (which I guess @lh3 should know about):
$ cat Olivier_GL261_cDNA_2017-Aug-04/correction/1-overlapper/mmap.000001.out
Running job 1 based on command line options.
Fetch blocks/000002.fasta
Fetch blocks/000003.fasta
[M::main::2.322*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.850*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::3.199*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::415.190*10.83] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi ./blocks/000001.fasta
[M::main] Real time: 415.221 sec; CPU: 4496.800 sec
[M::main::2.148*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.636*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.927*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::257.427*10.99] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000002.fasta
[M::main] Real time: 257.477 sec; CPU: 2829.252 sec
[M::main::1.905*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.392*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.729*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::190.526*10.95] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000003.fasta
[M::main] Real time: 190.568 sec; CPU: 2086.056 sec
INVALID OVERLAP 86395 (len 2017) 1 (len 871) hangs 1142 708 - 2096069 1767 flip 1
gringer@elegans:/mnt/gg_nanopore/gringer/reads_Olivier_GL261_cDNA_2017-Aug-04/called_albacore_2.0.2_all/workspace/pass$ /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000003.fasta
[ERROR] failed to open file './blocks/000001.mmi'
This one looks like it might have a canu problem as well:
$ cat Olivier_GL261_cDNA_2017-Aug-04/correction/1-overlapper/mmap.000010.out
Running job 10 based on command line options.
[M::main::3.025*1.00] loaded/built the index for 178470 target sequence(s)
[M::mm_mapopt_update::3.835*1.00] mid_occ = 108
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 178470
[M::mm_idx_stat::4.307*1.00] distinct minimizers: 25871645 (90.64% are singletons); average occurrences: 1.382; average spacing: 6.074
[M::worker_pipeline::213.853*10.81] mapped 178470 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000006.mmi ./blocks/000006.fasta
[M::main] Real time: 213.890 sec; CPU: 2312.716 sec
ls: cannot access 'queries/000010/*.fasta': No such file or directory
INVALID OVERLAP 1131471 (len 24723) 960001 (len 671) hangs 9531 14869 - 2082813 14689 flip 1
This might be a bug related to me only partially deleting directories. I'm not seeing these errors when I do a re-run of minimap using the FASTA files from the canu directories. I'll have another go at running this through from a fresh directory.
Nope, restarting didn't help for the overlap error. Looks like I need to do a bit of code sleuthing to work out what's going on here.
Edit: I see that the 'INVALID OVERLAP' message is from canu:
~/install/canu/canu/src$ grep -r 'INVALID OVERLAP' *
mhap/mhapConvert.C: fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n",
minimap/mmapConvert.C: fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n",
Thanks, really helpful
An issue for this error has been created for minimap2, because it seems like starts and ends are being reported that are greater than the read length:
First, it's not surprising MHAP is slow with a min overlap of 250, min read 500. It is not really designed for short reads, we were assuming minimum overlaps >1k when optimizing it. Given the sketch sizes it uses, you're indexing the entire read which means you don't get much benefit from the minhash.
The conversion can fail if the conversion gets confused and uses the wrong read IDs when parsing the bounds and so it will look like the coordinates are out of bounds. Looking at the minimap2 issue it looks like it was a minimap2 bug but in future you can run with saveOverlaps=true
which should keep the raw minimap2 output and all intermediates so it would be easier to figure out which program is failing.
Heng Li has just mentioned something in that issue that may make a difference when using minimap2 with Canu for determining overlaps:
...usually I wouldn't recommend to perform alignment (option -c) in the overlapping mode. First, this is much slower. Second, it introduces false overlaps. On pacbio C. elegans reads, doing alignment leads to a few more misassemblies.
Heng is talking about overlapping raw sequences. Canu should run minimap2 on the raw reads without the -c (the correction step). For trimming/unitigging, I'll argue you want an identity estimate so not overlaps are treated the same (since we're trying to resolve repeats below the read error rate) and the only way to get that from minimap2 is to compute the alignment.
@skoren CANU should work with the RNA seq data. But how to implement the two feature in RNA assembly such as cufflinks?
-F/--min-isoform-fraction suppress transcripts below this abundance level [ default: 0.10 ]
-j/--pre-mrna-fraction suppress intra-intronic transcripts below this level [ default: 0.15 ]
Should we correct out the reads by canu, then use cufflinks to assembly it? I'm not sure whether cufflinks support long reads.
Canu definitely won't work to assemble the RNA data. I would only advise using it for correction/trimming. If the genes are full length from the sequencing (which they should be) you shouldn't need to do assembly. You can just treat the corrected output as assembled transcripts. You can probably either map the resulting corrected data to a reference or do denovo clustering of the isoforms to identify genes/coverages but this isn't my expertise so I'm sure others can suggest better downstream analysis.
Closing, inactive. Seems correction/trimming works for RNA data but minimap2 is better for this application within Canu than the default overlapper.
@skoren:
the only way to get that from minimap2 is to compute the alignment
Minimap2 HEAD now estimates sequence divergence (i.e. 1-identity) without alignment, using an approach similar to mashmap. It seems to work, though I am not sure if it works well on your data. I'd love to hear your feedback in case you are interested. The sequence divergence is written to a new dv:f
tag in the PAF/SAM output.
Hi,
I have full length cDNA nanopore sequencing and would like to use canu to find consesus from these reads and do trimming. Is this something within canu abilities? If so, can I lower read length from 1000 to 400bp?