nanoporetech / medaka

Sequence correction provided by ONT Research
https://nanoporetech.com
Other
404 stars 74 forks source link

How to obtain haplotypes from single target locus long read sequencing #221

Closed saraheger closed 3 years ago

saraheger commented 3 years ago

Hi,

We would like to ask advise on what are the appropriate steps to phase and obtain haplotypes from a single target locus using ONT long read sequencing.

To provide context, we have genetic samples from a set of subjects for a targeted functional validation effort. Specifically, we want to understand how haplotypes across a 40kb genetic region are associated with a particular phenotype. We really want accurate phase information across the entire region; this is why we used ONT long read sequencing with a CRISPR-CAS9 brick-layer tiling approach (which provides overlapping reads that vary in size between 7 to 12kb). Besides off-target reads, there is no sequencing of other genomic areas in our experiment, we only care about the target locus.

So far, we’ve been able to create a draft assembly through pomoxis mini-assemble, run medaka_consensus, and run medaka_variant. However, we aren’t clear on whether we need to combine these different steps or if we can just use only medaka_variant. For medaka_variant, we are also not exactly sure on the settings/output. I list below a series of questions:

1) Could you advise us on a series of steps that would be appropriate for obtaining accurate haplotypes across an entire 40kb region of interest (cf. paragraph above)? This is our most pressing question as it likely would provide a lot of resolve to other questions listed below.

2) Is it correct that we should focus on medaka_variant to obtain phased haplotypes, given that the source code seems to incorporate whatshap phasing?

3) Should we polish our reads with a draft assembly using medaka consensus before running medaka_variant? In one attempt, we created a draft assembly with pomoxis and aligned reads from the raw .fastq to the draft assembly, but the output .bam file could not be used in medaka_variant. Instead the .bam file needed first to be converted back in to .fastq and aligned to the hg38 reference. Is this the expected pipeline, or should we consider a different approach?

4) In another attempt, we directly aligned raw .fastq data to the hg38 reference. To be clear, the hg38 reference is a .fasta file that contains just the genomic sequence of our 40kb target region. Is this the expected pipeline, or should we consider a different approach?

5) The output .vcf file from medaka_variant appears to indicate variant positions with regard to their position in the target area range (e.g. position 1000 in the locus comprising 40kb). Is this expected behavior, and if not, how can we resolve this?

6) Further, the output .vcf file appears to contain only variants that actually displayed variation in the sample (WT were not present). Is this expected behavior, and if not, how can we resolve this?

7) Lastly, how can we assess the quality of phasing for the haplotypes produced through medaka_variant? Your input to these questions would be highly appreciated. We are not experts in phasing or genome assembly, so we thank you in anticipation for your patience with our questions.

Thank you.

cjw85 commented 3 years ago

As you suggest I will answer what I thing is you main question, what would we recommend for obtaining haplotypes for your 40kb region?

You should use medaka_variant for this, providing it with the full human reference (not just your region of interest). It has an optional step to phase the final variants that are output. With the phased VCF you can use bcftools to apply the variants to the reference sequence to obtain a haplotype sequences.

Now to tidy up some remains points:

  1. This has happened because you provided only the region of interest. medaka has no way of knowing how this region relates to the full genome.
  2. Yes, medaka outputs records to the VCF file where a difference to the reference was detected.
  3. In the absence of a truth set, you will want to look at some of the phasing tools like whatshap and hapcut2 which can enumerate details of how they have arrived at the answer. More simply you may wish to directly analyse alignments of your reads to the two haplotypes. Given that your reads are 12kb and your target region only 40kb I would expect phasing to be a simple task.
saraheger commented 3 years ago

Hi, thank you for your response.

We have run medaka_variant on the full human reference and obtained the 2 haplotype fasta files using bcftools. Our follow-up questions are:

1.We are still not exactly clear how to integrate read polishing into the variant calling pipeline. We see from our reads that they still display a lot of insertions/deletions after running medaka_variant; also sometimes genotype calls appear slightly ambiguous despite reasonable coverage (x40-50). We would think both observation may indicate that reads need to be polished? We’re not sure how to implement that with the medaka tools

  1. Can we do reference-free variant calling with medaka and will it give us higher genotype quality/phased calls? a. We have tried running medaka_consensus, aligning the reads to the consensus.fasta and then running medaka_variant with the consensus.fasta as the reference and the input as the aligned reads. However, we are unsure if this is reasonable and of how to align the output with hg38.

  2. Could you elaborate more on how to “analyse alignments of your reads to the two haplotypes”? Since medaka uses whatshap, we would like to know how to use whatshap to do this.

cjw85 commented 3 years ago

Read polishing before variant calling is not something we have ever attempted, the necessary read-to-read alignment is more error prone than read-to-reference alignment.

I'm not sure what you mean by reference-free variant calling. There are tools in medaka to create a VCF by alignment of a single assembly sequence to a reference sequence, which are used in the medaka_haploid_variant workflow. These will perhaps yield slightly different results. @mwykes can detail their use.

I cannot comment in any great detail how to analyse the alignments of your reads to your reconstructed haplotypes; only that the alignments should be congruent with the haplotypes and any differences in the two haplotypes should be reflected in the reads.