samtools / bcftools

This is the official development repository for BCFtools. See installation instructions and other documentation here http://samtools.github.io/bcftools/howtos/install.html
http://samtools.github.io/bcftools/
Other
679 stars 239 forks source link

Normalize complex variants to meet the requirements of HGVS #1929

Closed user-tq closed 1 year ago

user-tq commented 1 year ago

Recently, I encountered a complex variant with an IGV image of image Simply put, this mutation is

GGAATTAAGAGAAG
G---------GAAC

I called it using mutet2, and ultimately it was represented as

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
7   55242465    .   GGAATTAAGA  G   .   .   .
7   55242478    .   G   C   .   .   .

This seems to follow the left alignment rule

But according to the 3'rule of hgvs

(I confirmed the direction of my focus on transcript NM_005228) this mutation is

GGAATTAAGAGAAG
GGAA---------C
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
7   55242469    .   TTAAGAGAAG  C   .   .   .

I also annotated the final results using VEP. Obviously, VEP does not merge p.hgvs variants.

7   55242465    .   GGAATTAAGA  G   .   .   CSQ=-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_001346897.2|protein_coding|18/26||NM_001346897.2:c.2104_2112del|NP_001333826.1:p.Leu702_Glu704del|2362-2370/3848|2101-2109/3276|701-703/1091|ELR/-|GAATTAAGA/-|rs121913436|1||1|||deletion|EntrezGene|||||NP_001333826.1||||||||||||3||||||||||||||||||drug_response||1||||||,-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_001346898.2|protein_coding|19/27||NM_001346898.2:c.2239_2247del|NP_001333827.1:p.Leu747_Glu749del|2497-2505/3983|2236-2244/3411|746-748/1136|ELR/-|GAATTAAGA/-|rs121913436|1||1|||deletion|EntrezGene|||||NP_001333827.1||||||||||||3||||||||||||||||||drug_response||1||||||,-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_001346899.2|protein_coding|18/27||NM_001346899.2:c.2104_2112del|NP_001333828.1:p.Leu702_Glu704del|2362-2370/9770|2101-2109/3498|701-703/1165|ELR/-|GAATTAAGA/-|rs121913436|1||1|||deletion|EntrezGene|||||NP_001333828.1||||||||||||3||||||||||||||||||drug_response||1||||||,-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_001346900.2|protein_coding|19/28||NM_001346900.2:c.2080_2088del|NP_001333829.1:p.Leu694_Glu696del|2268-2276/9676|2077-2085/3474|693-695/1157|ELR/-|GAATTAAGA/-|rs121913436|1||1|||deletion|EntrezGene|||||NP_001333829.1||||||||||||3||||||||||||||||||drug_response||1||||||,-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_001346941.2|protein_coding|13/22||NM_001346941.2:c.1438_1446del|NP_001333870.1:p.Leu480_Glu482del|1696-1704/9104|1435-1443/2832|479-481/943|ELR/-|GAATTAAGA/-|rs121913436|1||1|||deletion|EntrezGene|||||NP_001333870.1||||||||||||3||||||||||||||||||drug_response||1||||||,-|inframe_deletion|MODERATE|EGFR|1956|Transcript|NM_005228.5|protein_coding|19/28||NM_005228.5:c.2239_2247del|NP_005219.2:p.Leu747_Glu749del|2497-2505/9905|2236-2244/3633|746-748/1210|ELR/-|GAATTAAGA/-|rs121913436|1||1||1|deletion|EntrezGene||YES|||NP_005219.2||||||||||||3||||||||||||||||||drug_response||1||||||,-|downstream_gene_variant|MODIFIER|EGFR|1956|Transcript|NM_201284.2|protein_coding||||||||||rs121913436|1|3736|1|||deletion|EntrezGene|||||NP_958441.1||||||||||||||||||||||||||||||drug_response||1||||||,-|downstream_gene_variant|MODIFIER|EGFR-AS1|100507500|Transcript|NR_047551.1|lncRNA||||||||||rs121913436|1|4969|-1|||deletion|EntrezGene||YES|||||||||||||||||||||||||||||||||drug_response||1||||||
7   55242478    .   G   C   .   .   CSQ=C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_001346897.2|protein_coding|18/26||NM_001346897.2:c.2113G>C|NP_001333826.1:p.Ala705Pro|2374/3848|2113/3276|705/1091|A/P|Gca/Cca|rs121913229&COSV51830861|1||1|||SNV|EntrezGene|||||NP_001333826.1|||||||||tolerated(0.24)|probably_damaging(0.942)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_001346898.2|protein_coding|19/27||NM_001346898.2:c.2248G>C|NP_001333827.1:p.Ala750Pro|2509/3983|2248/3411|750/1136|A/P|Gca/Cca|rs121913229&COSV51830861|1||1|||SNV|EntrezGene|||||NP_001333827.1|||||||||tolerated(0.26)|probably_damaging(0.995)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_001346899.2|protein_coding|18/27||NM_001346899.2:c.2113G>C|NP_001333828.1:p.Ala705Pro|2374/9770|2113/3498|705/1165|A/P|Gca/Cca|rs121913229&COSV51830861|1||1|||SNV|EntrezGene|||||NP_001333828.1|||||||||tolerated(0.19)|probably_damaging(0.998)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_001346900.2|protein_coding|19/28||NM_001346900.2:c.2089G>C|NP_001333829.1:p.Ala697Pro|2280/9676|2089/3474|697/1157|A/P|Gca/Cca|rs121913229&COSV51830861|1||1|||SNV|EntrezGene|||||NP_001333829.1|||||||||tolerated(0.21)|probably_damaging(0.942)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_001346941.2|protein_coding|13/22||NM_001346941.2:c.1447G>C|NP_001333870.1:p.Ala483Pro|1708/9104|1447/2832|483/943|A/P|Gca/Cca|rs121913229&COSV51830861|1||1|||SNV|EntrezGene|||||NP_001333870.1|||||||||tolerated(0.2)|probably_damaging(0.986)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|missense_variant|MODERATE|EGFR|1956|Transcript|NM_005228.5|protein_coding|19/28||NM_005228.5:c.2248G>C|NP_005219.2:p.Ala750Pro|2509/9905|2248/3633|750/1210|A/P|Gca/Cca|rs121913229&COSV51830861|1||1||1|SNV|EntrezGene||YES|||NP_005219.2|||||||||tolerated(0.18)|probably_damaging(0.942)||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|downstream_gene_variant|MODIFIER|EGFR|1956|Transcript|NM_201284.2|protein_coding||||||||||rs121913229&COSV51830861|1|3748|1|||SNV|EntrezGene|||||NP_958441.1||||||||||||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||,C|downstream_gene_variant|MODIFIER|EGFR-AS1|100507500|Transcript|NR_047551.1|lncRNA||||||||||rs121913229&COSV51830861|1|4965|-1|||SNV|EntrezGene||YES|||||||||||||||||||||||||||||||||likely_pathogenic&uncertain_significance|0&1|1&1|25157968&15737014&22622260|||||
7   55242469    .   TTAAGAGAAG  C   .   .   CSQ=C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_001346897.2|protein_coding|18/26||NM_001346897.2:c.2104_2113delinsC|NP_001333826.1:p.Leu702_Ala705delinsPro|2365-2374/3848|2104-2113/3276|702-705/1091|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1|||indel|EntrezGene|||||NP_001333826.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_001346898.2|protein_coding|19/27||NM_001346898.2:c.2239_2248delinsC|NP_001333827.1:p.Leu747_Ala750delinsPro|2500-2509/3983|2239-2248/3411|747-750/1136|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1|||indel|EntrezGene|||||NP_001333827.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_001346899.2|protein_coding|18/27||NM_001346899.2:c.2104_2113delinsC|NP_001333828.1:p.Leu702_Ala705delinsPro|2365-2374/9770|2104-2113/3498|702-705/1165|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1|||indel|EntrezGene|||||NP_001333828.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_001346900.2|protein_coding|19/28||NM_001346900.2:c.2080_2089delinsC|NP_001333829.1:p.Leu694_Ala697delinsPro|2271-2280/9676|2080-2089/3474|694-697/1157|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1|||indel|EntrezGene|||||NP_001333829.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_001346941.2|protein_coding|13/22||NM_001346941.2:c.1438_1447delinsC|NP_001333870.1:p.Leu480_Ala483delinsPro|1699-1708/9104|1438-1447/2832|480-483/943|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1|||indel|EntrezGene|||||NP_001333870.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|protein_altering_variant|MODERATE|EGFR|1956|Transcript|NM_005228.5|protein_coding|19/28||NM_005228.5:c.2239_2248delinsC|NP_005219.2:p.Leu747_Ala750delinsPro|2500-2509/9905|2239-2248/3633|747-750/1210|LREA/P|TTAAGAGAAGca/Cca|rs727504278&COSV51765099|1||1||1|indel|EntrezGene||YES|||NP_005219.2||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|downstream_gene_variant|MODIFIER|EGFR|1956|Transcript|NM_201284.2|protein_coding||||||||||rs727504278&COSV51765099|1|3739|1|||indel|EntrezGene|||||NP_958441.1||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||,C|downstream_gene_variant|MODIFIER|EGFR-AS1|100507500|Transcript|NR_047551.1|lncRNA||||||||||rs727504278&COSV51765099|1|4965|-1|||indel|EntrezGene||YES|||||||||||||||||||||||||||||||||drug_response|0&1|1&1|24033266|||||

I am looking for tools to adjust the location of del occurrence based on the direction of the transcript of interest to meet the final HGVS requirements. It seems that bcftool cannot achieve this either, Can corresponding functions be added? Thank you for any suggestions

pd3 commented 1 year ago

Currently we have only left-normalizing functionality, in bcftools norm. The consequence annotation in bcftools csq is able right-align internally and give a consequence based on that when close to exon boundaries, but only to a limited extent, you may want to try that.

I can imagine extending bcftools norm to support flexible left vs right alignment for cases where there is no overlapping transcipt on the opposite strand.

pd3 commented 1 year ago

I checked that in this particular case bcftools csq gives the same prediction for both left- and right- aligned indel.

After checking the HGVS recommendations I am confused though: your example is in the gene EGFR which is on the plus strand, therefore this specific case should be left-aligned, no?

user-tq commented 1 year ago

Sorry for taking so long to reply, I think I need to clarify my question again,If there are any issues with my understanding, please point out. Direction of NM_005228 is + ,i get it from this.

GeneID:1956 ENSG00000146648.21  HGNC:3236   EGFR    epidermal growth factor receptor    NM_005228.5 NP_005219.2 ENST00000275493.7   ENSP00000275493.2   MANE Select 7   55019017    55211628    +

(I understand + as the direction of the transcript is the same as that of ref, Because ref is also ‘+’ strand of chromosomes)

left----------------ref------------------right
5’--------------transcript----------------3‘

now, bwa(or other alignment tool)say:the del should be the left of ref. hgvs say: the del should be the 3' pos of transcript. In the end, a different representation emerged。

GGAATTAAGAGAAG
G---------GAAC
GGAA---------C

Of course, they are actually the same thing。 As a result, different VCF displays have also emerged。

#  left- aligned
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
7   55242465    .   GGAATTAAGA  G   .   .   .
7   55242478    .   G   C   .   .   .

and

#hgvs
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
7   55242469    .   TTAAGAGAAG  C   .   .   .
pd3 commented 1 year ago

You are right, my mistake. Variants on + strand transcripts should be right-aligned while - strand transcript variants should be left-aligned.

pd3 commented 1 year ago

I just added support for this via a new option -g, --gff-annot. Please try it out and let me know if you spot anything odd.