Closed koido closed 11 months ago
Hello,
You are right, the rephased variants are either validated by the reads (unchanged) or the reads show they are flipped so they are switched e.g., 0|1
becomes 1|0
.
The output log files don't provide this information, but it can be extracted from the binary files or the final BCF.
You can use the bin_diff
tool with a command like :
./bin_diff --vcf <bcf generated from pp_extract> --samples <samples file generated from pp_extract> --binary1 <original binary generated by pp_extract> --binary2 <rephased binary from phase_caller> --more --extra-info --ac --ac-threshold 1
This will output in csv format
Sample name, PP, Switched, Num PIR, is SNP, AC, GT, VCF line
HGXXX,0.5,0,0,1,1,1|0,chr21 9867672 chr21:9867672:SG T G . . AC=1
...
The --ac-threshold <N>
option will do that only AC <= N
are output.
Switched
is if switched or not.
Num PIR
is number of phase informative reads, if 0, it has not been validated/switched by reads.
Let me know if you have any issue with this or if this answers your question.
Edit : Then to check the % of all singletons you have to compare the singletons with num PIR > 0
with all the singletons in the original BCF file (you can count them with a filter on AC=1
).
Best regards, Rick
Hi Rick,
Thank you very much.
It worked when I used --vcf-file
instead of --vcf
. (My version is de7f952)
Anyway, I'm very surprised at the very high speed! Thanks again.
Can I ask additional question related to this information?
In the phase_caller
analysis, we sometimes encounter a message like "No reads mapped to chr19:XXXXXX".
This message may be raised because some genotypes were initially missing (no mapped read) but imputed in the phasing steps in SHAPEIT5.
Do such variants not appear in the output from bin_diff
?
Best, Masaru
Hello,
you are right it is --vcf-file
(or -f
), I wrote it wrong in the example command, the help prompt of the program shows the right option.
The "No reads mapped to chr19:XXXXXX" message happens when the pile-up could not link reads to the SNP (no reads for the alt allele, no reads for the ref allele), this can happen for several reasons.
The pile-up filter function is defined here : https://github.com/rwk-unil/pp/blob/main/phase_caller/phase_caller.cpp#L351 It is very strict, it will not use reads that are of low quality (might depend on the read alignment program settings that generated the CRAM), will not use reads where one in the read pair is unmapped (e.g., at the edge of a large de novo insertion), it will not use reads that are both on the same strand (e.g., at the edge of an inversion) etc.
This was chosen as to not degrade the quality of the phasing with poor quality sequencing data.
If you want to investigate I'd recommend the following : Open the CRAM with IGV (https://igv.org/doc/desktop/) and enter the location chr19:XXXXXXXX and have a look. This will show exactly if there are reads there, if they carry the alt SNP, and if they are well mapped etc.
I know it is not possible to manually check for all variants but this will show you exactly what is happening at that specific location. You can even script IGV to automatically take screenshots of all the regions from the logs. https://igv.org/doc/desktop/#UserGuide/tools/batch/ If you write a script for IGV that creates a screenshot for every variant where there is something strange, you have hard evidence about what is going on there (also helps for debugging).
Something like :
new
genome hg38 1k/GATK
load /<path to>/cram_file.cram
viewaspairs
snapshotDirectory ~/snap
goto chr1:108186076-108188076
snapshot
goto chr1:108187479-108189479
snapshot
goto chr1:108187875-108189875
snapshot
(In the script the locations are -1000bp +1000bp around the variant loci, you can choose other ranges to zoom in or out).
The SAPPHIRE phase_caller
is very strict and will not try to resolve complex or ambiguous cases, that was a necessary choice to make it very fast, as it was made to rephrase the 200k of the UKB and next the 500k. If when you check the file in IGV and the reads seem well mapped and the SNP is there, check the MAPQ / BASEQ (click on the reads). phase_caller
only looks at MAPQ >= 50 and BASEQ >= 30, if these values are too high for your sequencing data let me know, I will add them as options to phase_caller
so you can choose other values.
Do such variants not appear in the output from
bin_diff
?
They should because they are in the binary file (otherwise phase_caller
would not have looked at them). They will appear as not switched and will have num PIR at zero. You can search the output for the location on the chromosome as it should be printed.
I appreciate the feedback, your input is very valuable for future upgrades to the tools.
Edit1 : typo
Edit2 : If you want to see all variants with bin_diff
remove the --ac-threshold 1
option, which will only show the singletons.
Best regards, Rick
Hi Rick,
Many thanks for the detailed, insightful comments. All of my questions have been resolved! I will continue to enjoy your nice tool.
I appreciate your hard work.
Best, Masaru
Hi Rick,
I'm sorry for interrupting you again. I have one question.
In the Fig 2' legend in your preprint, you showed that:
The rephased variants are validated and switched variants, right? Can we get this information for each variant from output binary files or log files?
Many thank.
Best, Masaru