kishwarshafin / pepper

PEPPER-Margin-DeepVariant
MIT License
244 stars 42 forks source link

Phasing into a single phaseset across targeted region of interest #172

Closed smf96 closed 1 year ago

smf96 commented 2 years ago

Hello,

I am interested in a 200kb region derived from a segmental duplication that includes 2 paralogous blocks with 98% identity. I have been targeting the region for nanopore sequencing with both cas9-mediated enrichment and adaptive sampling. I would love some help with parameterisation as sometimes Pepper-Margin-DeepVariant is able to phase across the whole region but sometimes it is split into many, many phasesets.

For example, here is the output from a single adaptive sampling run where it's able to phase into a 198kb phaseset: image

However, in this output from a run with cas9-targeted reads I end up with 58 phasesets with large portions not covered at all. image

This is after I have changed some of the options slightly to the below because with the defaults it splits into 68 phasesets.

singularity exec --bind /usr/lib/locale/,/mainfs/cansci/ \
/mainfs/cansci/pepper_deepvariant_r0.8.sif \
run_pepper_margin_deepvariant call_variant \
-b "${BAM}" \
-f "${REF}" \
-o "${OUTPUT_DIR2}" \
-p "${OUTPUT_PREFIX2}" \
-t ${THREADS} \
-r chr1:161440000-161700000 \
--phased_output \
--pepper_min_mapq 10 \
--pepper_min_snp_baseq 5 \
--pepper_min_indel_baseq 5 \
--pepper_snp_frequency 0.2 \
--pepper_insert_frequency 0.2 \
--pepper_delete_frequency 0.2 \
--pepper_min_coverage_threshold 10 \
--pepper_candidate_support_threshold 3 \
--pepper_snp_candidate_frequency_threshold 0.2 \
--pepper_indel_candidate_frequency_threshold 0.2 \
--ont_r9_guppy5_sup

Is there an explanation for this? Obviously the distribution of coverage is different for the cas9 vs adaptive sampling runs but the cas9 runs have a lot more depth. This is just two examples where library prep type is separated but for most of my samples I have combined data from cas9 and adaptive sampling runs. When I use nanopolish + whatshap on the cas9 reads, they are able to be phased within a single 198kb haploblock.

An additional question I had was about how variants are assigned to be hetero or homozygous? In the second image you can see lots of the homozgyous variants (light blue?) are above variants which in the raw read coverage track look more like heterozygous from the colours?

Any help for optimising the parameters would be immensely helpful

Cheers!

Sarah

smf96 commented 2 years ago

Hello again,

Not sure if this helps but I ran pepper-margin-deepVariant on the same cas9-targeted data after removing some of the shorter reads and it changed the number of phasesets that were generated...

<5kb removed =70 phasesets <10kb removed = 15 phasesets <20kb removed = 12 phasesets <30kb removed = 77 phasesets

Any idea how I can reduce this even further?

Thank you!

kishwarshafin commented 2 years ago

Hi @smf96 ,

You can possibly use the --margin_phase_model described here https://github.com/kishwarshafin/pepper/blob/r0.8/docs/usage/usage_and_parameters.md and give a custom model by modifying https://github.com/UCSC-nanopore-cgl/margin/blob/master/params/phase/allParams.haplotag.ont-r94g507.hp.json which is used during phasing.

I think it's much easier if you take the BAM and VCF independently and run margin on it with parameter changes which will be much faster for you to iterate over the phasing to see if things are improving. You can run the command you are running with --dry and extract the final margin command that phases the VCF at the end and only run that part with changed parameters to see how things are improving.

smf96 commented 2 years ago

Hello Kishwar,

Thanks very much for your response,

I'm more than happy to do either and see if phasing is improved, I'm just a bit stuck at where to begin with it all. Do you have any insight as to why the phasing with margin is so unsuccessful for the cas9 data?

As in, should I be playing with the parameters that exclude reads/variants from the high-quality dataset (quality of SNVs/indels) or maybe the parameters that determine if phasing is questionable/should extend? What are the defaults for the parameters:

phase.phasesetMinBinomialReadSplitLikelihood
phase.phasesetMaxDiscordantRatio
phase.phasesetMinSpanningReads

and do you think changing them could help my situation?

Many thanks!

kishwarshafin commented 2 years ago

@smf96 ,

One reason could be the quality of the variants predicted. Is it possible for you to plot the qual and gq values that you get in the VCF files? Maybe a lot of variants are low quality and margin is throwing them away? Also, the mapping quality and the read's base quality could cause margin to filter reads. You can drop all of the parameters to the lowest setting to see if that helps.

smf96 commented 2 years ago

Good morning Kishwar,

Isn't the quality of the variants predicted not what I was trying to improve with the pepper parameters in my first post? As increasing the stringency of those only reduced the number of phasesets from 68 to 58? That is still a huge amount compared to the 1 haploblock I can achieve when running the default parameters in nanopolish

I can plot the qual and gq but exactly which vcf files are you talking about? Maybe one of PEPPER_VARIANT_FULL.vcf.gz, PEPPER_VARIANT_OUTPUT_PEPPER.vcf.gz or PEPPER_VARIANT_OUTPUT_VARIANT_CALLING.vcf.gz

Below are screenshots from the html reports from when running on reads >10kb and >20kb:

10kb: image

20kb: image

In both there are lots with quality score near zero...but are those from the final X.phased.vcf.gz so also include the refCalls (homo for reference)?

What lowest settings for the mapping and basecall quality would you recommend lowering to?

kishwarshafin commented 1 year ago

Hi @smf96 ,

Sorry I missed your reply in this one. Given that it's been a long time, I am closing this issue. Please feel free to reopen if you still need help.