samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
662 stars 240 forks source link

rs41303343 is defined as chr7:99652771 A -> AA duplication in GenBank and EnsEMBL #2172

Closed davidyuyuan closed 5 months ago

davidyuyuan commented 5 months ago

The bcftool called chr7:99652770 TA -> TAA without norm, and chr7:99652770 T -> TA with norm.

chr7    99652770    .   TA  TAA 222.366 .   INDEL;IDV=44;IMF=0.44898;DP=98;VDB=0.99321;SGB=-0.693146;RPBZ=-0.0357195;MQBZ=0;MQSBZ=0;BQBZ=0.297792;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=23,31,27,17;MQ=60 GT:PL   0/1:255,0,255
chr7    99652770    .   T   TA  222.366 .   INDEL;IDV=44;IMF=0.44898;DP=98;VDB=0.99321;SGB=-0.693146;RPBZ=-0.0357195;MQBZ=0;MQSBZ=0;BQBZ=0.297792;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=23,31,27,17;MQ=60 GT:PL   0/1:255,0,255

Here is a code snippet used in this test

"${HOME}/bcftools/bcftools" mpileup -B -r "${regions}" -Ou -f "${ref_genome}" "${bam_file}" | \
  "${HOME}/bcftools/bcftools" call --ploidy "${ploidy}" -Oz -mv -o "${output_dir}/${file}.abnorm.vcf.gz" --write-index
"${HOME}/bcftools/bcftools" norm -m +any -f "${ref_genome}" -Oz -o "${output_dir}/${file}.unphased.vcf.gz" "${output_dir}/${file}.abnorm.vcf.gz" --write-index

GenBank and EnsEMBL chr7:99652771 A -> AA duplication GenBank: https://www.ncbi.nlm.nih.gov/snp/rs41303343 EnsEMBL: http://www.ensembl.org/Homo_sapiens/Variation/Explore?db=core;r=7:99652271-99653271;v=rs41303343;vdb=variation;vf=481290091

Does this have something to do with the HGVS 3'rule? I should use -g and provide Homo_sapiens.GRCh38.111.gff3.gz to help right-align variants in transcripts on the forward strand?

pd3 commented 5 months ago

I don't understand what is the question exactly? Indels often have ambiguous representation, the same event can be expressed in different ways. The process of normalization left-aligns the variants and trims them to their most parsimonious representation. However, when calling consequences, the transcript strand should be taken into account, this is why the option bcftools norm -g, --gff-annot was added, see also https://github.com/samtools/bcftools/issues/1929.

I hope this answers the question!

davidyuyuan commented 5 months ago

I did not ask the question the right way. My issue is that I assumed incorrectly that Homo_sapiens.GRCh38.111.gff3.gz is the one the bcftools is expecting. It does not seem the case. I am unable to find out which GFF is working with bcftools. Can you please comment on that?

pd3 commented 5 months ago

The command bcftools norm accepts GFF files in the format accepted by http://samtools.github.io/bcftools/bcftools.html#csq. I did not test with the version Homo_sapiens.GRCh38.111.gff3.gz specifically, but I would hope it works.

When you say it is not working, can you be more specific? Is it just that the variant ends up left-aligned or does the program complain?

In the command line you showed above, the bcftools norm -g option is not used.

davidyuyuan commented 5 months ago

Thank you for following up, Petr. I did a manual test and forgot to post the results here. I ran the following command and the summary showed that nothing was realigned:

jovyan@pgxsts-0:~/work$ "${HOME}/bcftools/bcftools" norm -m +any -f ~/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g ~/work/Homo_sapiens.GRCh38.111.gff3.gz ~/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf -Ov -o ~/work/hgvs.vcf
Parsing /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz ...
Note: truncated transcript ENST00000624652 with incomplete CDS (this is very common)
Indexed 194122 transcripts, 1429327 exons, 886891 CDSs, 381720 UTRs
Ignored the following biotypes:
        4x      .. vault_RNA
        1141x   .. TEC
        57722x  .. lncRNA
Warning: 16903 warnings were supressed, run with `--verbose 2` to see them all
Lines   total/split/joined/realigned/skipped:   28/0/0/0/0

In particular, I was hoping to seechr7:99652770 T -> TA turned to chr7:99652771 A -> AA so that I an annotate it with the ID of rs41303343 with a VCF that I generated with VEP. The output of hgvs.vcf still contains the original left-aligned position.

jovyan@pgxsts-0:~/work$ tail -40 hgvs.vcf 
##bcftools_normVersion=1.19+htslib-1.19
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz -Oz -o /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.unphased.vcf.gz --write-index /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.abnorm.vcf.gz; Date=Sun Apr 28 10:15:27 2024
##HiPhase_version="1.4.0-7bf2743"
##HiPhase_command="/opt/conda/bin/hiphase --bam /home/jovyan/work/sunnybrook/GRCh38/TwistPGx/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam --vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.unphased.vcf.gz --output-vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz --reference /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa --threads 16"
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##FORMAT=<ID=PF,Number=1,Type=String,Description="Phasing flag">
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -Ou /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz; Date=Sun Apr 28 10:15:58 2024
##FILTER=<ID=LowQual,Description="Set if true: QUAL<100 || (RPBZ<0.1 && QUAL<150) || (AC<2 && QUAL<150)">
##bcftools_filterVersion=1.19+htslib-1.19
##bcftools_filterCommand=filter -s LowQual -e 'QUAL<100 || (RPBZ<0.1 && QUAL<150) || (AC<2 && QUAL<150)' -Ob -o /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf --write-index; Date=Sun Apr 28 10:15:58 2024
##bcftools_normCommand=norm -m +any -f /home/jovyan/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g /home/jovyan/work/Homo_sapiens.GRCh38.111.gff3.gz -Ov -o /home/jovyan/work/hgvs.vcf /home/jovyan/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf; Date=Mon Apr 29 09:04:31 2024
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA19207
chr7    99648291        .       A       G       222.368 PASS    DP=107;VDB=0.00482988;SGB=-0.693147;RPBZ=-2.20546;MQBZ=0;MQSBZ=0;BQBZ=0.480742;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=31,28,27,21;MQ=60    GT:PL:PS        0|1:255,0,255:99648291
chr7    99648664        .       G       GT      43.9587 LowQual INDEL;IDV=58;IMF=0.491525;DP=118;VDB=0.00370827;SGB=-0.693147;RPBZ=0.19383;MQBZ=0;MQSBZ=0;BQBZ=0.378647;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=27,33,33,21;MQ=60   GT:PL   0/1:78,0,231
chr7    99652376        .       G       A       222.403 PASS    DP=105;VDB=0.762351;SGB=-0.693147;RPBZ=1.11905;MQBZ=0;MQSBZ=0;BQBZ=-0.844702;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=26,28,30,21;MQ=60      GT:PL:PS        0|1:255,0,255:99648291
chr7    99652770        .       T       TA      222.366 PASS    INDEL;IDV=44;IMF=0.44898;DP=98;VDB=0.99321;SGB=-0.693146;RPBZ=-0.0357195;MQBZ=0;MQSBZ=0;BQBZ=0.297792;SCBZ=0;MQ0F=0;AC=1;AN=2;DP4=23,31,27,17;MQ=60     GT:PL:PS        0|1:255,0,255:99648291
chr7    99653450        .       C       CAAAT,CAAATAAAT 195.247 PASS    INDEL;IDV=72;IMF=0.878049;DP=82;VDB=0.945056;SGB=-0.693147;RPBZ=-0.878944;MQBZ=0;MQSBZ=0;BQBZ=0.194831;SCBZ=-1.61738;MQ0F=0;AC=1,1;AN=2;DP4=7,3,37,35;MQ=60     GT:PL:PS        1|2:255,230,255,255,0,255:99648291

BTW, I also tried to run the command with --verbose 2 but was told that it was an unrecognized option.

jovyan@pgxsts-0:~/work$ "${HOME}/bcftools/bcftools" norm -m +any -f ~/work/GRCh38/GRCh38_full_analysis_set_plus_decoy_hla.fa -g ~/work/Homo_sapiens.GRCh38.111.gff3.gz ~/work/sunnybrook/results/NA19207/CYP3A5/actionable/GRCh38/NA19207.GRCh38_noalt.deepvariant.haplotagged.bam.vcf.gz.filtered.bcf -Ov -o ~/work/hgvs.vcf --verbose 2
norm: unrecognized option '--verbose'
pd3 commented 5 months ago

The --verbose option is missing, you are right, added it just now. The warnings are not essential here.

I checked the transcript and it is on the - strand, therefore the program does left-align the variants as requested by the HGVS 3'rule. Note there is the undocumented option --right-align which is intended for debugging, if you wish to use that to enforce right alignment.

I hope this resolves the issue!

davidyuyuan commented 5 months ago

Thanks Petr. I am happy to try --right-align today or tomorrow. May I assume that Homo_sapiens.GRCh38.111.gff3.gz is no longer needed if I use the --right-align option?

pd3 commented 5 months ago

Yes, it is not needed. Note that ALL indels will be right-aligned with this option, it really was intended only for debugging. I guess you could combine with -r, but....

davidyuyuan commented 5 months ago

Thank you for the further clarification. Much appreciated!