marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

rna seq analysis #641

Closed abayega closed 6 years ago

abayega commented 7 years ago

Hi,

I have full length cDNA nanopore sequencing and would like to use canu to find consesus from these reads and do trimming. Is this something within canu abilities? If so, can I lower read length from 1000 to 400bp?

skoren commented 7 years ago

It's not explicitly what it was designed for but could work. You probably would only need the correction step and then use the resulting data for analysis (I assume you don't want to try to assemble the cDNA sequences together). You can lower both read length and overlap length and you'd want to correct all the data (corOutCoverage=all in Canu 1.6).

abayega commented 7 years ago

I will try it and report back. Thanks

gringer commented 7 years ago

It'll be interesting to see what effect Canu trimming would have on the adapter sequences of reads.They'd be likely to have a lot of consistency for reads that spanned the full length of a transcript, so I would assume both adapter ends would be preserved in the trimmed reads (despite appearing in almost every sequence).

noncodo commented 7 years ago

It works surprisingly well... and fast

abayega commented 7 years ago

Which Canu version and settings are you using? I am using Canu1.6, on a system with 256Gb RAM, and 48 processors but its now taken 64 hours and counting on 1.5 million reads from RNA-Seq (median length 1kb)

gringer commented 7 years ago

Are you just doing the correction step, or a full assembly?

abayega commented 7 years ago

Correction first, then trimming. Not sure Canu can do this type of assembly. Also, we trim adapter before Canu correction

gringer commented 7 years ago

I get a very slow cormhap runtime with the following on a 1.539 Gb dataset (1.585M reads), as in 3 days for one of the 86 processes, run on a 12-thread computer with 64GB memory:

~/install/canu/canu-1.6/Linux-amd64/bin/canu -correct -nanopore-raw fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=100 minOverlapLength=50 genomeSize=500M

The precompute step took 13182 seconds.

gringer commented 7 years ago

running with overlapper=minimap allowed canu to progress much quicker. I've also adjusted the minimum and overlap lengths:

~/install/canu/canu-1.6/Linux-amd64/bin/canu -correct -nanopore-corrected fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=500 minOverlapLength=250 genomeSize=500M overlapper=minimap

precompute: 118 seconds cormhap: 4287 seconds

... except it fails:

----------------------------------------
--
-- MiniMap overlap jobs failed, retry.
--   job correction/1-overlapper/results/000001.ovb FAILED.
--   job correction/1-overlapper/results/000005.ovb FAILED.
--   job correction/1-overlapper/results/000006.ovb FAILED.
--   job correction/1-overlapper/results/000008.ovb FAILED.
--   job correction/1-overlapper/results/000009.ovb FAILED.
--   job correction/1-overlapper/results/000010.ovb FAILED.
--

The errors are not consistent. This one looks like it might be a minimap2 bug (which I guess @lh3 should know about):

$ cat Olivier_GL261_cDNA_2017-Aug-04/correction/1-overlapper/mmap.000001.out 
Running job 1 based on command line options.
Fetch blocks/000002.fasta
Fetch blocks/000003.fasta
[M::main::2.322*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.850*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::3.199*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::415.190*10.83] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi ./blocks/000001.fasta
[M::main] Real time: 415.221 sec; CPU: 4496.800 sec
[M::main::2.148*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.636*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.927*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::257.427*10.99] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000002.fasta
[M::main] Real time: 257.477 sec; CPU: 2829.252 sec
[M::main::1.905*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.392*1.00] mid_occ = 161
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.729*1.00] distinct minimizers: 22988070 (87.75% are singletons); average occurrences: 1.586; average spacing: 6.076
[M::worker_pipeline::190.526*10.95] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000003.fasta
[M::main] Real time: 190.568 sec; CPU: 2086.056 sec
INVALID OVERLAP    86395 (len   2017)        1 (len    871) hangs   1142    708 - 2096069   1767 flip 1
gringer@elegans:/mnt/gg_nanopore/gringer/reads_Olivier_GL261_cDNA_2017-Aug-04/called_albacore_2.0.2_all/workspace/pass$ /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000001.mmi queries/000001/000003.fasta
[ERROR] failed to open file './blocks/000001.mmi'

This one looks like it might have a canu problem as well:

$ cat Olivier_GL261_cDNA_2017-Aug-04/correction/1-overlapper/mmap.000010.out 
Running job 10 based on command line options.
[M::main::3.025*1.00] loaded/built the index for 178470 target sequence(s)
[M::mm_mapopt_update::3.835*1.00] mid_occ = 108
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 178470
[M::mm_idx_stat::4.307*1.00] distinct minimizers: 25871645 (90.64% are singletons); average occurrences: 1.382; average spacing: 6.074
[M::worker_pipeline::213.853*10.81] mapped 178470 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000006.mmi ./blocks/000006.fasta
[M::main] Real time: 213.890 sec; CPU: 2312.716 sec
ls: cannot access 'queries/000010/*.fasta': No such file or directory
INVALID OVERLAP  1131471 (len  24723)   960001 (len    671) hangs   9531  14869 - 2082813  14689 flip 1

This might be a bug related to me only partially deleting directories. I'm not seeing these errors when I do a re-run of minimap using the FASTA files from the canu directories. I'll have another go at running this through from a fresh directory.

gringer commented 7 years ago

Nope, restarting didn't help for the overlap error. Looks like I need to do a bit of code sleuthing to work out what's going on here.

Edit: I see that the 'INVALID OVERLAP' message is from canu:

~/install/canu/canu/src$ grep -r 'INVALID OVERLAP' *
mhap/mhapConvert.C:        fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n",
minimap/mmapConvert.C:        fprintf(stderr, "INVALID OVERLAP %8u (len %6d) %8u (len %6d) hangs %6lu %6lu - %6lu %6lu flip %lu\n",
abayega commented 7 years ago

Thanks, really helpful

gringer commented 7 years ago

An issue for this error has been created for minimap2, because it seems like starts and ends are being reported that are greater than the read length:

https://github.com/lh3/minimap2/issues/30

skoren commented 7 years ago

First, it's not surprising MHAP is slow with a min overlap of 250, min read 500. It is not really designed for short reads, we were assuming minimum overlaps >1k when optimizing it. Given the sketch sizes it uses, you're indexing the entire read which means you don't get much benefit from the minhash.

The conversion can fail if the conversion gets confused and uses the wrong read IDs when parsing the bounds and so it will look like the coordinates are out of bounds. Looking at the minimap2 issue it looks like it was a minimap2 bug but in future you can run with saveOverlaps=true which should keep the raw minimap2 output and all intermediates so it would be easier to figure out which program is failing.

gringer commented 7 years ago

Heng Li has just mentioned something in that issue that may make a difference when using minimap2 with Canu for determining overlaps:

...usually I wouldn't recommend to perform alignment (option -c) in the overlapping mode. First, this is much slower. Second, it introduces false overlaps. On pacbio C. elegans reads, doing alignment leads to a few more misassemblies.

skoren commented 7 years ago

Heng is talking about overlapping raw sequences. Canu should run minimap2 on the raw reads without the -c (the correction step). For trimming/unitigging, I'll argue you want an identity estimate so not overlaps are treated the same (since we're trying to resolve repeats below the read error rate) and the only way to get that from minimap2 is to compute the alignment.

wangyugui commented 7 years ago

@skoren CANU should work with the RNA seq data. But how to implement the two feature in RNA assembly such as cufflinks?

  -F/--min-isoform-fraction    suppress transcripts below this abundance level       [ default:   0.10 ]
  -j/--pre-mrna-fraction       suppress intra-intronic transcripts below this level  [ default:   0.15 ]

Should we correct out the reads by canu, then use cufflinks to assembly it? I'm not sure whether cufflinks support long reads.

skoren commented 7 years ago

Canu definitely won't work to assemble the RNA data. I would only advise using it for correction/trimming. If the genes are full length from the sequencing (which they should be) you shouldn't need to do assembly. You can just treat the corrected output as assembled transcripts. You can probably either map the resulting corrected data to a reference or do denovo clustering of the isoforms to identify genes/coverages but this isn't my expertise so I'm sure others can suggest better downstream analysis.

skoren commented 6 years ago

Closing, inactive. Seems correction/trimming works for RNA data but minimap2 is better for this application within Canu than the default overlapper.

lh3 commented 6 years ago

@skoren:

the only way to get that from minimap2 is to compute the alignment

Minimap2 HEAD now estimates sequence divergence (i.e. 1-identity) without alignment, using an approach similar to mashmap. It seems to work, though I am not sure if it works well on your data. I'd love to hear your feedback in case you are interested. The sequence divergence is written to a new dv:f tag in the PAF/SAM output.