odelaneau / shapeit5

Segmented HAPlotype Estimation and Imputation Tool
https://odelaneau.github.io/shapeit5/
MIT License
56 stars 9 forks source link

Read-aware Phasing #57

Closed hangsuUNC closed 8 months ago

hangsuUNC commented 11 months ago

Hi,

Thanks for this wonderful tool! I saw there is a Read-aware phasing mode in Shapeit with a read-based phasing script and import the read phasing results to the statistical phasing analysis: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#readaware .

I'm wondering if there is a way to integrate the output of other the existing read-based phasing methods, such as whatshap, to the Shapeit5 analysis? If yes, could you please pointing to the right direction?

Thanks,

Hang

hangsuUNC commented 11 months ago

Following this questions, I take multi-sample whatshap read-phased vcf as 1) inputs for Shapeit5 2) scaffolds using --scaffold parameter. The final results of these two choices are similar but not identical, but do have a number of differences from not using read-phased information. Could anyone provide some guidelines for these types of analysis? Thanks!

JosephLalli commented 11 months ago

In past answers, @odelaneau has suggested using Shapeit4 during common variant phasing to incorporate Whatshap read phasing. It is slower, but should produce comparable results to Shapeit5 common.

Note that Shapeit5_rare is only designed to phase rare variants in a large panel onto previously phased common variants. If I understand correctly, all samples in the scaffold must be present in the panel to be phased, and all variants in a scaffold must be phased.

Until these restrictions are loosened, my best idea for using read-aware phasing and Shapeit5's rare phasing algorithm is: 1) Perform read-aware phasing using Shapeit4 a) Optionally, first perform read-aware phasing with a reference panel using Shapeit4 b) Then, once again phase whatshap read-phased variant with Shapeit4, using the product of 1a as a scaffold. 2) Use bcftools view to drop all variants below the rare variant MAF cutoff frequency. 3) Perform rare phasing using Shapeit5, using step 1's product as the scaffold. a) Optionally, first merge your callset and scaffold with a phased reference panel to allow Shapeit5 to incorporate information regarding tracts of identity by descent in the ref panel. 4) Write a script to merge step 3's product with step 1's product. One sensible approach would be to take the GTs with a phase set from the Shapeit4 output, and the GTs without a phase set from the Shapeit5 output, but there are likely other ways to do it. a) A script like this is on my personal to-do-list, but I keep putting it off hoping Oliver will make this exercise moot in a future update to shapeit5's rare variant algorithm by allowing for unknown or partially phased variants in the scaffold. And if there's already a way to do that which I don't know about, please let me know @odelaneau, @srubinacci, or @RJHFMSTR!

hangsuUNC commented 11 months ago

Hi Joseph,

Thanks for your reply! This is super helpful.

A few questions from your reply:

  1. When you say "perform read-aware phasing with a reference panel using Shapeit4" in 1a, what's the input and output?
  2. What's the difference between Shapeit5 phase_common and Shapeit4? Is the read-phased mode compatible for Shapeit5 phase_common?

Just make sure I understand this clearly: Step 1: a). Use unphased vcf as input for Shapeit5 phase-common, create a scaffold bcf. b). perform whatshap read phasing using --tag=PS, take this phased.vcf as input and the output from 1a as scaffold for Shapeit4. shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --use-PS 0.0001 --output phased.bcf

Can I use shapeit5 phase_common for this step?

Step2: Filtering variants Step 3: Shapeit5 phase_rare using 1's output as scaffold Step4: Merge results

Thanks a lot for your inputs!

Best,

Hang

JosephLalli commented 10 months ago

Before I begin, a few caveats: I have not written shapeit5, and I'm not involved in the project. I have just been using it a lot in my work (small/medium scale human genome phasing, with an eye towards sex-specific differences in eQTLs).

A few questions from your reply:

When you say "perform read-aware phasing with a reference panel using Shapeit4" in 1a, what's the input and output?

Input: 1) The output of whatshap - a vcf file with your variant calls, partially phased using your sample's reads as provided in a bam/cram file. 2) A vcf of known phased genomes (I just use the phased 1000 Genomes variant calls to keep it simple). 3) The reference genome used to produce 1/2/3. Note that best practice is to use bcftool norm to split multiallelic variants in both VCF files into biallelic variants. I'd point you to Shapeit4's instructions page for more information on how to use Shapeit4, and the shapeit4 paper for more information on how Shapeit4 defines and uses scaffolds vs reference panels vs read groups. What's the difference between Shapeit5 phase_common and Shapeit4? Is the read-phased mode compatible for Shapeit5 phase_common? My understanding is that SHAPEIT5 is a stripped-down, highly optimized version of Shapeit4. The underlying algorithm is the same, but some features of Shapeit4 were removed to allow Shapeit5 to run on hundreds of thousands of samples in a reasonable amount of time. Shapeit4 and shapeit5 should produce extremely similar results when run with the same settings, but shapeit4 is able to incorporate additional sources of information regarding your variants. Most importantly for the purposes of this reply, Shapeit5 now discards all information related to whatshap-created phase sets. This is an intentional change, and Dr. Delaneau has previously said there are no plans to add this feature into shapeit5.

Just make sure I understand this clearly: Step 1: a). Use unphased vcf as input for Shapeit5 phase-common, create a scaffold bcf. b). perform whatshap read phasing using --tag=PS, take this phased.vcf as input and the output from 1a as scaffold for Shapeit4. shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --use-PS 0.0001 --output phased.bcf

Can I use shapeit5 phase_common for this step? Sorry, I wasn't very clear before. Flip steps 1a) and 1b). 1a) First, use whatshap to read-phase your unphased variant calls. 1b) Optional 1b, phasing with a reference panel. Variants in your dataset that are not present in the reference panel will not be reported, so the output will only contain a subset of all variants called. Next, use Shapeit4 with the following settings: --input my_unphased_variants.bcf --map my.map.gz --region my_region --use-PS 0.0001 --sequencing --reference /my/phased/panel/of/variants.bcf --output panel_and_read_phased_variants.bcf 2) Use shapeit4 to phase common and rare variants that are not present in the reference panel onto the dataset generated in 1b. shapeit4 --input my_unphased_variants.bcf --map my.map.gz --use-PS 0.0001 --sequencing --scaffold panel_and_read_phased_variants.bcf --output shapeit4_all_variants_phased_read_informed.bcf 3) note this is a slight change from before to fix something I just noticed Filter out singletons and rare variants from shapeit4_all_variants_phased_read_informed.bcf. bcftools view --min-af 0.001:nonmajor --min-ac 2 shapeit4_all_variants_phased_read_informed.bcf > shapeit4_common_variants_read_informed.bcf 4) Use shapeit5_rare to phase rare variants and singletons onto the read-informed shapeit4 common variant scaffold using shapeit5's superior rare variant/singleton phasing approach: shapeit5_rare --input my_unphased_variants.bcf --map my.map.gz --scaffold shapeit4_common_variants_read_informed.bcf --input-region $chrom:$start-$end --scaffold-region $chrom:$start-$end --output shapeit5_all_variants_phased.bcf 5) Finally, for maximum accuracy, write a script to iterate through the rare sites in shapeit5_all_variants_phased.bcf. Compare the rare genotype calls to shapeit4_all_variants_phased_read_informed.bcf. If the two files disagree on the heterozygous site's phase, use shapeit4's output if whatshap had assigned the genotype to a phase set. Otherwise, use shapeit5's output.

I should also mention that this is a very hacky method, it is not using the software as intended, it is overly complicated for most uses, and is designed to squeeze as much accuracy as possible out of your phasing without regard to the time or computational resources required. But, it seems to work in my hands. I do not endorse this method without significant further debugging/analysis.

hangsuUNC commented 10 months ago

Hi Joseph,

Thank you so much for your reply and this detailed instruction! I really appreciate your kindness and help!

I'll try this on my data and hope to talk with you more in the near future.

Hang