vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
205 stars 36 forks source link

Request for suggestion on using HapCUT2 for phasing my genome of interest #33

Closed amit4mchiba closed 4 years ago

amit4mchiba commented 6 years ago

Hi,

I am writing here to seek your opinion on using HapCUT2 to perform phasing genome of a plant that I am interested at. I have a plant genome that I am working on, genome size 400Mb, 11 chromosome, deploid. I got 120x genome coverage using PacBio, 96x genome coverage using Illumina, 150x Genome coverage with length cutoff as 150Bps using Bionano genome mapping and 80x genome coverage of Hi-C library (on-going). I complete genome assembly using Canu (complete genome in 243 contigs with N50 as 9.87Mb) and Falcon-unzip (complete genome in 193 contigs with N50 as 6.9Mb). I really want to performing Phasing of my genome using HapCUT2 tool and wanted your suggestion on this. I was wondering if I should complete genome assembly by first getting genome using Pacbio, followed by polishing using Hi-seq, then hybrid assembly using Bionano and the scaffolding Hi-C. Then the final assembly thus obtained should be used for phasing. Or shall i first get PacBio assembly, polish using Hi-seq, then map Hi-seq reads to this obtained genome and then perform phasing using HapCUT2, and then use the obatined haploigs to achieve hybrid assembly using bionano data, and test quality and perform scaffolding through Hi-C data. I have read your method and trying to understand it further. I was curious about what suggestion you will give me in this hybrid assembly approach. I will really appreciate your suggestion.

dcopetti commented 6 years ago

Hello, I also have a similar case, with a diploid plant assembly (assembly size = 2n genome size, most BUSCO genes are in duplicate) that I would like to phase. I can produce Hi-C data, but so far I did not find a tool that will scaffold in a phase-aware manner. I was able to anchor (to a close relative) 80% of the sequence, but I could not go beyond assigning scaffolds to a region. Dovetail does not guarantee on the phasing yet, I wonder if HapCUT2 can. Thanks for any input, Dario

pjedge commented 6 years ago

Hi, sorry for the late reply. I should preface this by saying really can't make definitive statements about best practices here since I've never done anything along these lines (de novo genome assembly followed by HapCUT2 on the assembly). However, I can speculate about what I think may work, as a starting point.

In general, you must stay as closely as possible to HapCUT2's intended use case: given reads mapped to a (highly accurate) reference genome, and a set of (highly accurately) diploid genotyped variants on that same reference, HapCUT2 will return phased variants. Beyond this specific use case, we can't confidently make statements about how HapCUT2 will perform. So since a reference sequence is not available, I would recommend first using other tools and pipelines to obtain the most accurate haploid genome assemblies that you can possibly obtain, and use the haploid assembly in place of the reference genome for haplotype assembly with HapCUT2 as a final step.

So in @amit4mchiba's case, I think that means doing the first suggested pipeline. Also, once you have obtained your "reference genome" (haploid assembly), and you are ready to use HapCUT2, you should remap the HiSeq reads to the "reference genome" and call variants against it. These will serve as the input VCF variants for HapCUT2. Then, use extractHairs on both the PacBio reads and the HiC reads (using that set of variants) and combine those two sets of fragments, and use them as the input fragments for HapCUT2. The Haplotype assembly should be highly accurate if the variants are called with HiSeq, and the haplotype assembly is performed with HiC+PacBio fragments combined. Just note that this all depends on the haploid assembly also being highly accurate (and complete!).

As for @dcopetti, I can't say as much. Just note that HapCUT2 does NOT perform any sort of assembly scaffolding. If you decide to use HapCUT2 to phase an assembly, you should use it on a haploid assembly that is known to be highly accurate and complete. I.E. the haploid assembly should have negligible errors in its structure, with only the heterozygous variants and phase unresolved. If this assumption is true, you should be able to follow a standard haplotyping pipeline, just using the haploid assembly (or half of the chromosomes of a diploid assembly) in place of reference. So, remap the reads to the haploid assembly, call variants on the reads, extract the reads as fragments with extractHairs, and then input the VCF and fragments to HapCUT2 to get haplotypes.

yilunhuangyue commented 6 years ago

Hi, @pjedge, I have followed the pipeline you suggest for @amit4mchiba, and you said " use extractHairs on both the PacBio reads and the HiC reads (using that set of variants) ", I just want to make sure the variants you mentioned here is the variant called by Hiseq reads or Pacbio reads? Thanks a lot.

orangeSi commented 6 years ago

ref or assembly Result for HapCut2 must be haploid or diploid assembly result? Because assembly software usually output diploid assembly Result.

for phase of diploid assembly, https://www.nature.com/articles/s41477-018-0172-3 this use Haplomerger

everestial commented 6 years ago

Hi there, I stumbled upon this post when working on HapCut2. I had haplotype phasing issue with my own sample since there are no reference panels and not a lots of sequence genotypes.

I have come up with a method for phasing heterogenous and F1 hybrids genomes using markov chain. You can check it out as phaseIT but the beginning analyses point would be phaseRB or HapCut2 itself.

After that haplotye blocks can be phased using phaseExtender or phaseStitcher, depending upon your problem.

These tools are mainly geared toward phasing emerging model systems, genome that are heterogenous (and produce large haplotype blocks). The model is based on first order markov chain and phase state classification using maxSum and/or maxProduct algorithm.

Let me know if you have any questions.

Happy Phasing !

pjedge commented 6 years ago

Hi everyone, there is a new method for combining Hi-C and PacBio data into phased diploid assemblies. I would recommend that anyone trying to do this should look into this method first, before trying HapCUT2: (FALCON-phase paper) FALCON-phase paper