marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
657 stars 179 forks source link

trio-binning with hicanu #1868

Closed zhenzhenyang-psu closed 3 years ago

zhenzhenyang-psu commented 3 years ago

hello Surgey, I have 30x of HiFi reads with which I would like to use by running Trio-canu for a Haplotype-specific assembly. Yet I don't have the trio information. However, by other approaches, I have partitioned 2/3 of the input HiFi reads to 2 haplotypes.

canu -haplotype -p trio -d trio_canu_out genomeSize=3g useGrid=remote -haplotype1 haplotype1_reads.fq -haplotype2 haplotype2_reads.fq -pacbio-hifi all.hifi.comb.fq

Do you know if I pass the HiFi reads (instead of Illumina from the parents) to canu, will it be able to run?

In my dataset, as there remains 1/3 of the HiFi reads that are not assigned to any haplotypes. I wonder how canu deals with them. Does it split them across the two haplotypes?

Would you mind sharing for human, what percentage of partitioned HiFi reads is a good number to yield good resulting assembly statistics?

If I ran trio-canu of HG002, will it produce the HiFI reads that are sorted by haplotypes? thanks very much! Looking forward to your answers.

In hifiasm paper, they mentioned the following:

For trio- binning assembly, we ran HiCanu in two steps as recommended. We partitioned the HiFi reads by parental short reads with the following command: 
canu -haplotype -p asm -d <outDir> genomeSize=<GSize> useGrid=false \ maxThreads=<nThreads> -haplotypePat <pat-reads.fq> -haplotypeMat <mat-reads.fq> \ -pacbio-raw <HiFi-reads.fasta>
Note that ‘-pacbio-raw’ was used to partition HiFi reads followed the document of HiCanu. We then perform HiCanu assemblies on partitioned reads.

From this, it looks like HiCanu will separate the input HiFi reads to two haplotypes and separately assembles them using a haploid assembly mode?

Any more comments on this would be greatly appreciated! Zhenzhen

skoren commented 3 years ago

The trio binning is based on k-mers so any reliable data (HiFi or Illumina) for the parents would be OK. This is what the hifiasm paper did, the reason they provided pacbio-raw for hifi data was a requirement that the data to be split is uncorrected. The two partitions are assembled as HiFi data later on. It doesn't sound like in your case you have HiFi data for the parents but you're instead partitioning the HiFi data yourself? In that case, it doesn't make sense to use trio binning, you've already binned the data. Just assemble the partitioned reads separately, you'd probably need to randomly distribute the unbinned data too. In the trio case, Canu would put the unbinned reads into both haplotypes for assembly as well. Human data usually has about 30% unassigned, it varies by human and by read length somewhat, but is similar to your experience.

I'd expect that HiFi data is good enough that you can get good haplotype separation without any partioning of the data (phase block length will depend on the heterozygosity of your sample). You're going to have a hard time beating just default HiCanu HiFi haplotype separation w/a custom scheme. So, I'd just run the full dataset with default HiCanu parameters and run purge_dups afterwards to get a primary and alt set of contigs. You can always compare this assembly to your bins or use the binned reads to validate/estimate the phase block lengths.

zhenzhenyang-psu commented 3 years ago

HI Surgey, Thanks very much for your reply. very insightful and helpful indeed. You are right I have already partitioned the data into bins, so I should not do trio binning. The reason I am running the trio binning mode is because I would like to use certain module of hicanu to assemble my two assemblies separately.

"Just assemble the partitioned reads separately, you'd probably need to randomly distribute the unbinned data too. " in this case, I can throw the remaining 1/3 of the reads to both haplotypes and assemble the two haplotypes separately. For HIFI reads in my case, should I try early versions of canu? or should I run hicanu? As for hicanu, I just want to constraint hicanu in a way so that it assembles contigs with fewer phase block switches than the default mode. Any suggestions are greatly appreciated! thanks much, zhenzhen

skoren commented 3 years ago

Once you have partitions, there is no difference between triocanu and running two separate assemblies of the data. That's what triocanu does, the main addition when you run triocanu is that it will bin the data.

Run hicanu as two separate runs runs, one for each of your bins. To keep coverage consistent, randomly split the unbinned reads into the two partitions. You may still need to run purge_dups on the bins though because, unless your bins are perfect and have no erroneously assigned reads, the mis-binned reads will generate multiple haplotypes in the assemblies.

zhenzhenyang-psu commented 3 years ago

thanks. Just one more question, when you say " To keep coverage consistent, randomly split the unbinned reads into the two partitions." , do you mean "put the unbinned reads into both haplotypes for assembly" ?

skoren commented 3 years ago

I mean randomly split the reads in half, otherwise you'd expect to have double the coverage of those regions in each bin.

zhenzhenyang-psu commented 3 years ago

I see. But for trio-hicanu, it would actually use the unbinnned reads for both haplotypes for assembly, right? I guess the expectation for random splitting would split reads from the same region evenly across the two bins, however, there is a risk if reads from region1 are completely assigned to bin1, and reads from region 2 are completely assigned to bin2. This may cause region2 fail to be assembled and cause a gap in bin1 and vice versa. Could this be a problem?

skoren commented 3 years ago

If the binning worked correctly, the unbinned reads should be completely homozygous. In that case it should be fine to split them as each region would have 2x the expected coverage and you'd be extremely unlikely to put all 2x solely into bin1 or bin2. We've run binned assemblies with split reads this way w/o issues before. The more likely scenario is that the binning is not fully correct so you end up putting incompatible haplotypes into bin1 or 2 leading to lower continuity than you'd get had you let HiCanu use all the data at once (but longer phase blocks).