marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
659 stars 179 forks source link

Canu trio binning with HiFi data #1615

Closed egoltsman closed 4 years ago

egoltsman commented 4 years ago

Hi Sergey, This is in regards to Canu v. 1.9 and the trio binning workflow. I could confirm with my local install that it works as described in the documentation when the F1 reads are provided with the -pacbio-raw option. From reading issue #1559 it seems like it's possible to use HiFi (CCS) reads instead, (although the user did this in a stepwise fashion). Is this a supported workflow, or are we jumping the gun here? When I execute the same set of commands given in the documentation but with -pacbio-hifi or -pacbio-corrected instead of -pacbio-raw, then no haplotyping is performed.

So the best thing I could try is run the haplotyping step alone, pretending that the long reads are of CLR type (in case that even matters at this stage), and then running two separate assemblies on the binned reads as -pacbio-hifi. That seemed to work, but the reads didn't get separated, with 98% of the reads piling up in one haplotype. Combing through the outputs, the kmer distributions look good to me for both parents, with distinct heterozygous and homozygous peaks in each. But in the file splitHaplotype.000001.out there's a line that suggests that for one of the haplotypes the kmer selection picked the depth cutoffs incorrectly:

--  Haplotype './0-kmers/haplotype-M05.meryl':
--   use kmers with frequency at least 1021.

Naturally, very few kmers got selected for this bin, and so the binning largely failed. Going by the histogram, the frequency peaks for this parental data set were at 14x and 28x for the hetero- and homozygous kmers, which seems ok. For some reason, the program skipped past them and picked the low threshold way out in the outliers.

Has this been reported before, and is there a way to force/override the depth thresholds at this step?

Thank you in advance!

skoren commented 4 years ago

You can bin the HiFi data in a stepwise fashion, this is the only currently supported way to bin HiFi data. Typically, we don't bin HiFi data at all as we can separate haplotypes and preserve large phase blocks w/o trio information using it. You could also try this strategy (just assemble the data) followed by purge_dups.

If you do want to bin, there is a way to overwrite the threshold manually. Edit the splitHaplotype.sh file to remove the statistics file and change it to an int, e.g. -H ./0-kmers/haplotype-O157.meryl ./0-kmers/reads-O157.statistics ./haplotype-O157.fasta.gz would become -H ./0-kmers/haplotype-O157.meryl 5 ./haplotype-O157.fasta.gz. Remove haplotype.log and any fasta.gz files and re-run splitHaplotype.sh by hand.

Lastly, can you share the statistics files to see why such a high threshold was selected? You should be able to attach them to the issue.

egoltsman commented 4 years ago

Hi Sergey, Thank you for a prompt reply. I will try what you suggested. I looked at the code in splitHaplotype.c, , and the reason for it failing to find a reasonable cutoff in my case is that my first genomic peak is not quite 2 times the magnitude of the last chosen through, i.e. this condition is never satisfied:  L390:   if (2 minAve aveSize < thisSum) L391:        break;

I guess this is a reasonable expectation for well covered Illumina data sets. Mine is only 15x average, so the through is at 5x, roughly, which means there is likely a large mix of error-kmers and true low-depth kmers at the through.  Thanks again, Eugene  

egoltsman commented 4 years ago

I couldn't find splitHaplotype.sh in my install, probably because I've downloaded the binary. I just tried installing from source but had trouble with the library linking etc.

g++ --version g++ (SUSE Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]

. . . g++ -o /global/u1/e/eugeneg/utils/canu-master/Linux-amd64/obj/bin/dumpBlob/stores/dumpBlob.o -c -MD -g3 -O4 -funroll-loops -fexpensive-optimizations -finline-functions -fomit-frame-pointer -Wall -Wextra -Wformat -Wno-write-strings -Wno-char-subscripts -Wno-sign-compare -Wno-unused-function -Wno-unused-parameter -Wno-unused-variable -Wno-deprecated-declarations -std=c++11 -pthread -fopenmp -fPIC -DLIBBACKTRACE -I/global/u1/e/eugeneg/utils/canu-master/src -Iutility stores/dumpBlob.C g++ -o /global/u1/e/eugeneg/utils/canu-master/Linux-amd64/bin/dumpBlob -pthread -fopenmp -lm -L/global/u1/e/eugeneg/utils/canu-master/Linux-amd64/lib /global/u1/e/eugeneg/utils/canu-master/Linux-amd64/obj/bin/dumpBlob/stores/dumpBlob.o -lcanu /usr/lib64/gcc/x86_64-suse-linux/7/../../../../x86_64-suse-linux/bin/ld: /global/u1/e/eugeneg/utils/canu-master/Linux-amd64/lib/libcanu.a(mmap.o): in function backtrace_vector_grow': mmap.c:(.text+0x271): undefined reference to__intel_avx_rep_memcpy'

skoren commented 4 years ago

There's no splitHaplotype.sh in the canu distribution, it's auto generated and will be in your run directory under the haplotype folder.

egoltsman commented 4 years ago

Ah, of course. Running now.
Thank you!