vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
206 stars 36 forks source link

Obtaining Phased Haplotypes #111

Open psur9757 opened 3 years ago

psur9757 commented 3 years ago

Input: PacBio long read, HiC and Illumina short read data Assembly: Canu v2.1.1 and then run my assembly through purgeHaplotigs Variants: FreeBayes

I process HiC and PacBio files as recommended in HiC_longread recipe. My question is what to do next to get a phased haplotype FASTA file? How do I know which blocks belong together?

Thank you.

vibansal commented 3 years ago

You should be able to use bcftools consensus (http://samtools.github.io/bcftools/bcftools.html#consensus) to generate fasta files for each haplotype. The output vcf file has an identifier for each phased variant specifiying which block it belongs to.

psur9757 commented 3 years ago

@vibansal I think I am explaining it wrong. Since each contig is processed in parallel. How does HapCut2 know which blocks within a contig belong together?

Lets say the draft assembly has 4 contigs representing two copies of a chromosome. Since HapCut2 analysed each contig in parallel, how does it know which blocks (of a contig) belong together? How does it provide a recipe to create the two copies correctly, especially in terms of ordering of blocks in the chromosome? I understand the phasing bit, I think.

Sorry for the confusing question.

vibansal commented 3 years ago

Hapcut2 is designed to reconstruct haplotypes for a diploid genome using reads mapped to a haploid consensus. For each group of variants that can be linked together by the reads, it outputs two haplotype sequences at heterozygous variant sites. I don't understand your objective completely but I don't think that HapCUT2 is designed to do that.

arcadianlyric commented 3 years ago

You should be able to use bcftools consensus (http://samtools.github.io/bcftools/bcftools.html#consensus) to generate fasta files for each haplotype. The output vcf file has an identifier for each phased variant specifiying which block it belongs to.

Hi, I tried to get consensus sequence with vcf and noticed that some SNP with allele type 1/2 in phased blocks are converted to 0 or 1 in fasta, it seems 2 is not included in .vcf ? Thank you.

vibansal commented 3 years ago

Thank you for reporting this, we will fix this soon.