How to get validated and switched information for each variant?

koido commented 11 months ago

Hi Rick,

I'm sorry for interrupting you again. I have one question.

In the Fig 2' legend in your preprint, you showed that:

singletons that have been rephased by the SAPPHIRE method (43% of all singletons)

The rephased variants are validated and switched variants, right? Can we get this information for each variant from output binary files or log files?

Many thank.

Best, Masaru

rwk-unil commented 11 months ago

Hello, You are right, the rephased variants are either validated by the reads (unchanged) or the reads show they are flipped so they are switched e.g., 0|1 becomes 1|0.

The output log files don't provide this information, but it can be extracted from the binary files or the final BCF.

You can use the bin_diff tool with a command like :

./bin_diff --vcf <bcf generated from pp_extract> --samples <samples file generated from pp_extract> --binary1 <original binary generated by pp_extract> --binary2 <rephased binary from phase_caller> --more --extra-info --ac --ac-threshold 1

This will output in csv format

Sample name, PP, Switched, Num PIR, is SNP, AC, GT, VCF line
HGXXX,0.5,0,0,1,1,1|0,chr21   9867672 chr21:9867672:SG        T       G       .       .       AC=1
...

The --ac-threshold <N> option will do that only AC <= N are output. Switched is if switched or not. Num PIR is number of phase informative reads, if 0, it has not been validated/switched by reads.

Let me know if you have any issue with this or if this answers your question.

Edit : Then to check the % of all singletons you have to compare the singletons with num PIR > 0 with all the singletons in the original BCF file (you can count them with a filter on AC=1).

Best regards, Rick

koido commented 11 months ago

Hi Rick,

Thank you very much.

It worked when I used --vcf-file instead of --vcf. (My version is de7f952) Anyway, I'm very surprised at the very high speed! Thanks again.

Can I ask additional question related to this information? In the phase_caller analysis, we sometimes encounter a message like "No reads mapped to chr19:XXXXXX". This message may be raised because some genotypes were initially missing (no mapped read) but imputed in the phasing steps in SHAPEIT5. Do such variants not appear in the output from bin_diff?

Best, Masaru

rwk-unil commented 11 months ago

Hello, you are right it is --vcf-file (or -f), I wrote it wrong in the example command, the help prompt of the program shows the right option.

The "No reads mapped to chr19:XXXXXX" message happens when the pile-up could not link reads to the SNP (no reads for the alt allele, no reads for the ref allele), this can happen for several reasons.

The SNP might be imputed but not there in the sequencing data. (but this should still give sequencing reads for the reference SNP, unless there are no reads mapped at that location).
The phase caller uses a very strict filter for the read quality so it might be that the reads at this region simply didn't pass, this might happen for complex regions (where there are structural variations, e.g., near the edge of a large insertion/deletion/inversion/duplication). It might also be that there are no reads mapped to that region (also because the sample differs too much from the reference at that region).

The pile-up filter function is defined here : https://github.com/rwk-unil/pp/blob/main/phase_caller/phase_caller.cpp#L351 It is very strict, it will not use reads that are of low quality (might depend on the read alignment program settings that generated the CRAM), will not use reads where one in the read pair is unmapped (e.g., at the edge of a large de novo insertion), it will not use reads that are both on the same strand (e.g., at the edge of an inversion) etc.

This was chosen as to not degrade the quality of the phasing with poor quality sequencing data.

If you want to investigate I'd recommend the following : Open the CRAM with IGV (https://igv.org/doc/desktop/) and enter the location chr19:XXXXXXXX and have a look. This will show exactly if there are reads there, if they carry the alt SNP, and if they are well mapped etc.

I know it is not possible to manually check for all variants but this will show you exactly what is happening at that specific location. You can even script IGV to automatically take screenshots of all the regions from the logs. https://igv.org/doc/desktop/#UserGuide/tools/batch/ If you write a script for IGV that creates a screenshot for every variant where there is something strange, you have hard evidence about what is going on there (also helps for debugging).

Something like :

new
genome hg38 1k/GATK
load /<path to>/cram_file.cram
viewaspairs
snapshotDirectory ~/snap
goto chr1:108186076-108188076
snapshot
goto chr1:108187479-108189479
snapshot
goto chr1:108187875-108189875
snapshot

(In the script the locations are -1000bp +1000bp around the variant loci, you can choose other ranges to zoom in or out).

The SAPPHIRE phase_caller is very strict and will not try to resolve complex or ambiguous cases, that was a necessary choice to make it very fast, as it was made to rephrase the 200k of the UKB and next the 500k. If when you check the file in IGV and the reads seem well mapped and the SNP is there, check the MAPQ / BASEQ (click on the reads). phase_caller only looks at MAPQ >= 50 and BASEQ >= 30, if these values are too high for your sequencing data let me know, I will add them as options to phase_caller so you can choose other values.

Do such variants not appear in the output from bin_diff?

They should because they are in the binary file (otherwise phase_caller would not have looked at them). They will appear as not switched and will have num PIR at zero. You can search the output for the location on the chromosome as it should be printed.

I appreciate the feedback, your input is very valuable for future upgrades to the tools.

Edit1 : typo Edit2 : If you want to see all variants with bin_diff remove the --ac-threshold 1 option, which will only show the singletons.

Best regards, Rick

koido commented 11 months ago

Hi Rick,

Many thanks for the detailed, insightful comments. All of my questions have been resolved! I will continue to enjoy your nice tool.

I appreciate your hard work.

Best, Masaru

rwk-unil / sapphire

How to get validated and switched information for each variant? #2