Closed davidyuyuan closed 5 months ago
I don't understand what is the question exactly? Indels often have ambiguous representation, the same event can be expressed in different ways. The process of normalization left-aligns the variants and trims them to their most parsimonious representation. However, when calling consequences, the transcript strand should be taken into account, this is why the option bcftools norm -g, --gff-annot
was added, see also https://github.com/samtools/bcftools/issues/1929.
I hope this answers the question!
I did not ask the question the right way. My issue is that I assumed incorrectly that Homo_sapiens.GRCh38.111.gff3.gz is the one the bcftools is expecting. It does not seem the case. I am unable to find out which GFF is working with bcftools. Can you please comment on that?
The command bcftools norm
accepts GFF files in the format accepted by http://samtools.github.io/bcftools/bcftools.html#csq. I did not test with the version Homo_sapiens.GRCh38.111.gff3.gz specifically, but I would hope it works.
When you say it is not working, can you be more specific? Is it just that the variant ends up left-aligned or does the program complain?
In the command line you showed above, the bcftools norm -g
option is not used.
Thank you for following up, Petr. I did a manual test and forgot to post the results here. I ran the following command and the summary showed that nothing was realigned:
jovyan@pgxsts-0:~/work$ "${HOME}/bcftools/bcftools" norm -m +any -f ~/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g ~/work/Homo_sapiens.GRCh38.111.gff3.gz ~/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf -Ov -o ~/work/hgvs.vcf
Parsing /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz ...
Note: truncated transcript ENST00000624652 with incomplete CDS (this is very common)
Indexed 194122 transcripts, 1429327 exons, 886891 CDSs, 381720 UTRs
Ignored the following biotypes:
4x .. vault_RNA
1141x .. TEC
57722x .. lncRNA
Warning: 16903 warnings were supressed, run with `--verbose 2` to see them all
Lines total/split/joined/realigned/skipped: 28/0/0/0/0
In particular, I was hoping to seechr7:99652770 T -> TA
turned to chr7:99652771 A -> AA
so that I an annotate it with the ID of rs41303343 with a VCF that I generated with VEP. The output of hgvs.vcf still contains the original left-aligned position.
jovyan@pgxsts-0:~/work$ tail -40 hgvs.vcf
##bcftools_normVersion=1.19+htslib-1.19
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz -Oz -o /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.unphased.vcf.gz --write-index /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.abnorm.vcf.gz; Date=Sun Apr 28 10:15:27 2024
##HiPhase_version="1.4.0-7bf2743"
##HiPhase_command="/opt/conda/bin/hiphase --bam /home/jovyan/work/sunnybrook/GRCh38/TwistPGx/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam --vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.unphased.vcf.gz --output-vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz --reference /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa --threads 16"
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##FORMAT=<ID=PF,Number=1,Type=String,Description="Phasing flag">
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -Ou /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz; Date=Sun Apr 28 10:15:58 2024
##FILTER=<ID=LowQual,Description="Set if true: QUAL<100 || (RPBZ<0.1 && QUAL<150) || (AC<2 && QUAL<150)">
##bcftools_filterVersion=1.19+htslib-1.19
##bcftools_filterCommand=filter -s LowQual -e 'QUAL<100 || (RPBZ<0.1 && QUAL<150) || (AC<2 && QUAL<150)' -Ob -o /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf --write-index; Date=Sun Apr 28 10:15:58 2024
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz -Ov -o /home/jovyan/work/hgvs.vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf; Date=Mon Apr 29 09:04:31 2024
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19207
chr7 99648291 . A G 222.368 PASS DP=107;VDB=0.00482988;SGB=-0.693147;RPBZ=-2.20546;MQBZ=0;MQSBZ=0;BQBZ=0.480742;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=31,28,27,21;MQ=60 GT:PL:PS 0|1:255,0,255:99648291
chr7 99648664 . G GT 43.9587 LowQual INDEL;IDV=58;IMF=0.491525;DP=118;VDB=0.00370827;SGB=-0.693147;RPBZ=0.19383;MQBZ=0;MQSBZ=0;BQBZ=0.378647;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=27,33,33,21;MQ=60 GT:PL 0/1:78,0,231
chr7 99652376 . G A 222.403 PASS DP=105;VDB=0.762351;SGB=-0.693147;RPBZ=1.11905;MQBZ=0;MQSBZ=0;BQBZ=-0.844702;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=26,28,30,21;MQ=60 GT:PL:PS 0|1:255,0,255:99648291
chr7 99652770 . T TA 222.366 PASS INDEL;IDV=44;IMF=0.44898;DP=98;VDB=0.99321;SGB=-0.693146;RPBZ=-0.0357195;MQBZ=0;MQSBZ=0;BQBZ=0.297792;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=23,31,27,17;MQ=60 GT:PL:PS 0|1:255,0,255:99648291
chr7 99653450 . C CAAAT,CAAATAAAT 195.247 PASS INDEL;IDV=72;IMF=0.878049;DP=82;VDB=0.945056;SGB=-0.693147;RPBZ=-0.878944;MQBZ=0;MQSBZ=0;BQBZ=0.194831;SCBZ=-1.61738;MQ0F=0;AC=1,1;AN=2;DP4=7,3,37,35;MQ=60 GT:PL:PS 1|2:255,230,255,255,0,255:99648291
BTW, I also tried to run the command with --verbose 2
but was told that it was an unrecognized option.
jovyan@pgxsts-0:~/work$ "${HOME}/bcftools/bcftools" norm -m +any -f ~/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g ~/work/Homo_sapiens.GRCh38.111.gff3.gz ~/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf -Ov -o ~/work/hgvs.vcf --verbose 2
norm: unrecognized option '--verbose'
The --verbose
option is missing, you are right, added it just now. The warnings are not essential here.
I checked the transcript and it is on the -
strand, therefore the program does left-align the variants as requested by the HGVS 3'rule. Note there is the undocumented option --right-align
which is intended for debugging, if you wish to use that to enforce right alignment.
I hope this resolves the issue!
Thanks Petr. I am happy to try --right-align today or tomorrow. May I assume that Homo_sapiens.GRCh38.111.gff3.gz is no longer needed if I use the --right-align option?
Yes, it is not needed. Note that ALL indels will be right-aligned with this option, it really was intended only for debugging. I guess you could combine with -r
, but....
Thank you for the further clarification. Much appreciated!
The bcftool called chr7:99652770 TA -> TAA without norm, and chr7:99652770 T -> TA with norm.
Here is a code snippet used in this test
GenBank and EnsEMBL chr7:99652771 A -> AA duplication GenBank: https://www.ncbi.nlm.nih.gov/snp/rs41303343 EnsEMBL: http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=7:99652271-99653271;v=rs41303343;vdb=variation;vf=481290091
Does this have something to do with the HGVS 3'rule? I should use -g and provide Homo_sapiens.GRCh38.111.gff3.gz to help right-align variants in transcripts on the forward strand?