vibansal / HapCUT2

software tools for haplotype assembly from sequence data
BSD 2-Clause "Simplified" License
205 stars 36 forks source link

Running HAPCUT2 after Canu #52

Closed psur9757 closed 4 years ago

psur9757 commented 6 years ago

I have created a collapsed assembly using Canu (https://github.com/marbl/canu) and now would like to run HAPCUT2 for phasing. Is there a recipe or code for this. Thank you.

From Canu Smash haplotypes together and then do phasing using another approach (like HapCUT2 or whatshap or others).

Best Regards, Priyanka

pjedge commented 6 years ago

Hi Priyanka,

we don't officially support the use case of phasing assemblies (only because I have not seen published works demonstrating that it is accurate). But, it seems like it should work in theory, and other users seem to have done it. What sort of data do you have available?

psur9757 commented 6 years ago

Hi Peter,

We have PacBio reads with about 160x coverage. I have created a de novo assembly using Canu. Since the authors of Canu recommended it, I thought this must be done. If you could provide some guidance, it would be most helpful. Thank you.

Best, Priyanka

Priyanka Surana | Postdoctoral Research Associate Plant Breeding Institute | The University of Sydney Room A2-4, 107 Cobbitty Rd, Cobbitty, NSW 2570 +61 2 9351 8825 (o) | +61 481 785 312 (m)

From: Peter Edge notifications@github.com Reply-To: vibansal/HapCUT2 reply@reply.github.com Date: Wednesday, 12 September 2018 at 4:45 am To: vibansal/HapCUT2 HapCUT2@noreply.github.com Cc: Priyanka Surana priyanka.surana@sydney.edu.au, Author author@noreply.github.com Subject: Re: [vibansal/HapCUT2] Running HAPCUT2 after Canu (#52)

Hi Priyanka,

we don't officially support the use case of phasing assemblies (only because I have not seen published works demonstrating that it is accurate). But, it seems like it should work in theory, and other users seem to have done it. What sort of data do you have available?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://protect-au.mimecast.com/s/4hTuCoVzGQi4P51Xi1eOzM?domain=github.com, or mute the threadhttps://protect-au.mimecast.com/s/WhYCCp8AJQtDOqJzcDIRwU?domain=github.com.

pjedge commented 6 years ago

Hi Priyanka, The biggest challenge in this case will be determining the heterozygous sites and accurately genotyping them. Normally, with the currently available techniques, this requires short reads.

However, the general technique would look like this:

  1. convert your assembly into fasta format (it should be haploid / squashed ploidy)
  2. align the PacBio reads to the fasta-format assembly (BLASR/minimap2).
  3. obtain an accurate set of heterozygous variants in VCF format (if you had short reads, you would align the short reads to the assembly and then call variants with, for example, GATK)
  4. proceed as normal with extractHairs + HapCUT (use --pacbio 1 option when running extractHairs)

Step 3 will be a challenge for you. Regardless of coverage, the genotypes at heterozygous sites will be extremely inaccurate using PacBio reads becuase there aren't very good methods for this.

psur9757 commented 6 years ago

Hi Peter,

I can get Illumina reads for both DNA and RNA as well. The haplophasing is important to us so let me know what is the ideal method and I will get the required data.

Best, Priyanka

pjedge commented 6 years ago

Hi Priyanka, If you have WGS (genomic) short reads at reasonable coverage (30x minimum) I would expect the steps I described to work. Be sure to filter the variants (step 3) using best practices so that the VCF being used with extractHairs/HapCUT2 only contains highly accurate variants that you're confident are real. Note that in regions where the assembly is inaccurate, or the reads are mapped inaccurately, the haplotypes will also be bad. And a final note: you'll undoubtedly find heterozygous short indels along with SNVs, using the short reads. Be aware that if you try to phase these (with the --indels 1 option in extractHAIRS, along with the --pacbio 1 option), it can be done but the accuracy of the very small indels (1-2 bp) may not be very good. These small indels are hard to allelotype on long reads, which then hurts phasing accuracy. However, with such high coverage it may be alright.

psur9757 commented 6 years ago

Hi Peter,

My assembly was created using PacBio sequences, so in step 2, I am re-aligning these same PacBio sequences to my assembly using minimap2. I just want to make sure, I am not misunderstanding you.

Best, Priyanka

pjedge commented 6 years ago

Correct. The idea is that you're using the haploid assembly as the "reference genome".

wjidea commented 6 years ago
  1. convert your assembly into fasta format (it should be haploid / squashed ploidy)
  2. align the PacBio reads to the fasta-format assembly (BLASR/minimap2).
  3. obtain an accurate set of heterozygous variants in VCF format (if you had short reads, you would align the short reads to the assembly and then call variants with, for example, GATK)
  4. proceed as normal with extractHairs + HapCUT (use --pacbio 1 option when running extractHairs)

In terms of step 3, does it make sense to polish the assembly with pilon first, then call variants using GATK or Freebayes?

Jie

pjedge commented 5 years ago

Probably. You should take any steps to make sure the assembly is as accurate of a haploid assembly as possible. I can't say for certain since I've never done this. If anyone has done this with good success, please share here.