marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
652 stars 178 forks source link

unitiging fails on *simulated* CCS data #1633

Closed egoltsman closed 4 years ago

egoltsman commented 4 years ago

Hi Sergey,

I have a simulated data set that I'm using for assembler accuracy benchmarking, and Canu fails to build any unitigs from it. It's a sampling from C.elegans reference, with read lengths fixed at 15 kb, and Poisson depth distribution peaking at 30x. Our crude approximation of the PacBio error model was applied, with the rate of 1-2% indels per read. There is something about this error profile that trips up Canu. The reason I'm suspecting the error rate is because an equivalent error-free simulation assembles just fine. In the directory 5-consensus there are over 57000 consensus*.out files, and with a few exceptions, they all say "Processed 0 tigs and 1 singleton". Something clearly went wrong upstream, and unitigger.err file seems to tell me that most overlaps were rejected. I'm attaching that file below.

Can you please let me know if there are some cutoffs one can adjust to allow this data to assemble?

This is the 2.0 version I checked out of the repo two weeks ago: Canu branch hicanu_rc +325 changes (r9818 86bb2e221546c76437887d3a0ff5ab9546f85317)

Canu was executed with this command: canu pacbio-hifi /global/cscratch1/sd/eugeneg/SEQ_DATA_FOR_HIMPER/C.elegans/C_elegans.CCS-sim.15K.30x.q40.fq -d $(pwd)/canu.v2.0_asm.CCS-sim.15K.30x -p canu.v2.0_asm -genomeSize=100m -maxThreads=32 -maxMemory=400g -stageDirectory=$(pwd) -useGrid=False

Thank you! Gene unitigger.zip

egoltsman commented 4 years ago

p.s. I have tried the release version (1.9) and saw similar issues there.

skoren commented 4 years ago

In short, your simulated data doesn't really reflect real HiFi data. The reads are typically cutoff at Q20 which is 99% accuracy so 1% error is the upper bound. The median error rate is usually over Q30 or 99.9%. The errors are also not random indels but concentrated in homopolymers and simple-sequence repeats. Since Canu is optimized to deal with errors observed in real data, it can't sufficiently correct your simulated data. You may be able to assemble this data with -pacbio-corrected as your simulation is more similar to those reads.

egoltsman commented 4 years ago

Thank you Sergey! I re-did the simulation with 0.01% error, and that worked fine.