openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Variants that lift over into a different variant at hg38. #202

Open ifokkema opened 4 years ago

ifokkema commented 4 years ago

Hi Pete,

I have an interesting case for you. Perhaps an annoying one. I accidentally ran into some variants that suddenly make me doubt the quality of the hg38 mappings using UTA.

When providing an hg19 description, VV also generates the hg38 description and checks it. However, what it checks, is the description that is returned by UTA based on simple mapping calculations. VV doesn't seem to check if the involved sequence is the same. This makes VV vulnerable for the same mistakes that Mutalyzer makes when mapping and lifting over. A duplication of an "A" on hg19 should map to a duplication of an "A" on hg38, unless that's the wildtype sequence of hg38. It should not map to an insertion of some other random base, but apparently, this does seem to happen sometimes. And a G>A substitution on hg19 should not be able to map to a C>T substitution on hg38, right?

The following variant got my attention simply because of the dup/ins issue discussed in #191, but then I happened to notice the ref/alt fields. I normally don't use them, and normally discard them, but it looks like they serve an interesting purpose now.

NC_000001.10:g.145509010_145509016dup

Reduced output, note the VCF fields:

{
  "NC_000001.10:g.145509010_145509016dup": {
    "NC_000001.10:g.145509010_145509016dup": {
      "g_hgvs": "NC_000001.10:g.145509010_145509016dup",
      "hgvs_t_and_p": {
        "NM_005105.3": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                  "alt": "CGTTGACT",
                  "ref": "C"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.145926083_145926084insGTCAACA",
                "vcf": {
                  "alt": "AAGTCAAC",
                  "ref": "A"
                }
              }
            }
          },
          "t_hgvs": "NM_005105.3:c.437_443dup",
        },
        "NM_005105.4": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                  "alt": "CGTTGACT",
                  "ref": "C"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.145926077_145926083dup",
                "vcf": {
                  "alt": "CCAGTCAA",
                  "ref": "C"
                }
              }
            }
          },
          "t_hgvs": "NM_005105.4:c.437_443dup",
        }
      }
    }
  }
}

This variant, a duplication of "GTTGACT" on hg19, turns to either an insertion of "GTCAACA" or a duplication of "CAGTCAA" on hg38, depending on which transcript's mapping is being used. The two hg38 descriptions are similar; of course a duplication can be shifted differently if the surrounding sequence is different, and the inserted sequence can therefore look different, but it should at least contain the same bases, as is the case here. It just shows that the insertion happened at a different location due to mapping differences in the transcripts. But the hg19 description versus the hg38 description is completely different. It inserts a completely different sequence.

I thought this was maybe due to severe differences in sequence between the two builds around these position, so I decided to ask VV to predict the liftover of the variant's original surrounding sequence; NC_000001.10:g.145509000_145509020= shows that WT hg19 equals a WT hg38 in that location.

I'm at a loss how this can occur.


A similar case is NC_000001.10:g.150069206_150069207del; you can see in the reduced JSON below that multiple transcripts provide multiple descriptions (or no hg38 descriptions), and the REF fields indicate a different sequence has been deleted. Based on the VCF fields, I would say NR_103998.1's mapping is the correct one, and the others are all wrong.

When sending this variant's surrounding sequence to VT, I got a combination of WT hg38 (again, NR_103998.1 and some transcripts), while other transcripts returned a delins.

{
  "NC_000001.10:g.150069206_150069207del": {
    "NC_000001.10:g.150069206_150069207del": {
      "g_hgvs": "NC_000001.10:g.150069206_150069207del",
      "hgvs_t_and_p": {
        "NM_001279353.1": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.150097110_150097111del",
                "vcf": {
                  "alt": "G",
                  "ref": "GAC"
                }
              }
            }
          },
          "t_hgvs": "NM_001279353.1:c.1178+3462_1178+3463del",
        },
        "NM_001279354.1": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.150097110_150097111del",
                "vcf": {
                  "alt": "G",
                  "ref": "GAC"
                }
              }
            }
          },
          "t_hgvs": "NM_001279354.1:c.1385+3462_1385+3463del",
        },
        "NM_001279355.1": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.150097110_150097111del",
                "vcf": {
                  "alt": "G",
                  "ref": "GAC"
                }
              }
            }
          },
          "t_hgvs": "NM_001279355.1:c.1124+3462_1124+3463del",
        },
        "NM_007259.3": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {}
          },
          "t_hgvs": "NM_007259.3:c.1493+3462_1493+3463del",
        },
        "NM_007259.4": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.150097110_150097111del",
                "vcf": {
                  "alt": "G",
                  "ref": "GAC"
                }
              }
            }
          },
          "t_hgvs": "NM_007259.4:c.1493+3462_1493+3463del",
        },
        "NM_007259.5": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {}
          },
          "t_hgvs": "NM_007259.5:c.1493+3462_1493+3463del",
        },
        "NR_103998.1": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.150069206_150069207del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.150097104_150097105del",
                "vcf": {
                  "alt": "C",
                  "ref": "CTT"
                }
              }
            }
          },
          "t_hgvs": "NR_103998.1:n.1477-1746_1477-1745del",
        }
      }
    }
  }
}

Finally, here's NC_000001.10:g.145473384G>A, mapping to two different C>T substitutions on hg38.

{
  "NC_000001.10:g.145473384G>A": {
    "NC_000001.10:g.145473384G>A": {
      "g_hgvs": "NC_000001.10:g.145473384G>A",
      "hgvs_t_and_p": {
        "NM_001039888.2": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.145473384G>A",
                "vcf": {
                  "alt": "A",
                  "ref": "G"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.145961702C>T",
                "vcf": {
                  "alt": "T",
                  "ref": "C"
                }
              }
            }
          },
          "t_hgvs": "NM_001039888.2:c.56G>A",
        },
        "NM_001039888.3": {
          "primary_assembly_loci": {
            "grch37": {
              "NC_000001.10": {
                "hgvs_genomic_description": "NC_000001.10:g.145473384G>A",
                "vcf": {
                  "alt": "A",
                  "ref": "G"
                }
              }
            },
            "grch38": {
              "NC_000001.11": {
                "hgvs_genomic_description": "NC_000001.11:g.145961704C>T",
                "vcf": {
                  "alt": "T",
                  "ref": "C"
                }
              }
            }
          },
          "t_hgvs": "NM_001039888.3:c.56G>A",
        }
      }
    }
  }
}

I'm guessing UTA says these transcripts map on the reverse strand in hg38.

Peter-J-Freeman commented 4 years ago

I need to look at this. It is possibly some sort of structural/sequential difference between GRCh37 and GRCh38. Might have to run some alignments!!!!

ifokkema commented 2 years ago

Hi Pete, is the server perhaps not updated yet? The provided links still provide incorrect output. (I've updated the links to use rest.variantvalidator.org again)

ifokkema commented 1 year ago

Hi Pete, it seems this never got fixed. Could you reopen this one, please?

Peter-J-Freeman commented 1 year ago

Yes, good time to reopen it because we have the new version of VVTA built ready for testing

Peter-J-Freeman commented 1 year ago

right @ifokkema . Need to try resolve this quickly. Not sure why it got closed.

So, for the input description NC_000001.10:g.145509010_145509016dup

What I am seeing in the initial post is that depending on the transcript selected, the c. descriptions are

NM_005105.3:c.437_443dup NM_005105.4:c.437_443dup

So the transcripts are equally aligned to GRCh37, but mapping to GRCh38 is suspect

NM_005105.3:c.437_443dup > NC_000001.11:g.145926083_145926084insGTCAACA NM_005105.4:c.437_443dup > NC_000001.11:g.145926077_145926083dup

Going to add some working using the latest builds of VV, VF and the databases

Peter-J-Freeman commented 1 year ago

First, what does VV do?

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000001.10:g.145509010_145509016dup' # variant 1
genome_build = 'GRCh37'
select_transcripts = 'all'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_005105.3:c.437_443dup": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS916.1",
                "HPRD": "05609",
                "ensemblgene": null,
                "hgnc": "HGNC:9905",
                "ncbigene": "9939",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1q21.1",
            "note": "RNA binding motif protein 8A",
            "refseq_select": false,
            "variant": "0"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS72872"
            ],
            "ensembl_gene_id": "ENSG00000265241",
            "entrez_gene_id": "9939",
            "hgnc_id": "HGNC:9905",
            "omim_id": [
                "605313"
            ],
            "ucsc_id": "uc031uto.3"
        },
        "gene_symbol": "RBM8A",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_574p1:p.(W148Cfs*2)",
            "lrg_tlr": "LRG_574p1:p.(Trp148CysfsTer2)",
            "slr": "NP_005096.1:p.(W148Cfs*2)",
            "tlr": "NP_005096.1:p.(Trp148CysfsTer2)"
        },
        "hgvs_refseqgene_variant": "NG_032654.1:g.6454_6460dup",
        "hgvs_transcript_variant": "NM_005105.3:c.437_443dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926083_145926084insGTCAACG",   # OK, here it is for the .3 version
                "vcf": {
                    "alt": "AGTCAACG",
                    "chr": "1",
                    "pos": "145926083",
                    "ref": "A"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "chr1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926083_145926084insGTCAACG",
                "vcf": {
                    "alt": "AGTCAACG",
                    "chr": "chr1",
                    "pos": "145926083",
                    "ref": "A"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_005096.1",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_032654.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_005105.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NC_000001.10:g.145509010_145509016dup",
        "transcript_description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_005105.3 is available (NM_005105.5): NM_005105.5:c.437_443dup MUST be fully validated prior to use in reports: select_variants=NM_005105.5:c.437_443dup"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NG_032654.1": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "NM_005105.4:c.437_443dup": {
        "alt_genomic_loci": [
            {
                "grch37": {
                    "hgvs_genomic_description": "NW_003871055.3:g.2741491_2741497dup",
                    "vcf": {
                        "alt": "CCAGTCAA",
                        "chr": "HG1287_PATCH",
                        "pos": "2741489",
                        "ref": "C"
                    }
                }
            },
            {
                "hg19": {
                    "hgvs_genomic_description": "NW_003871055.3:g.2741491_2741497dup",
                    "vcf": {
                        "alt": "CCAGTCAA",
                        "chr": "NW_003871055.3",
                        "pos": "2741489",
                        "ref": "C"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS72872.1",
                "ensemblgene": null,
                "hgnc": "HGNC:9905",
                "ncbigene": "9939",
                "select": "RefSeq"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1q21.1",
            "note": "RNA binding motif protein 8A",
            "refseq_select": true,
            "variant": "0"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS72872"
            ],
            "ensembl_gene_id": "ENSG00000265241",
            "entrez_gene_id": "9939",
            "hgnc_id": "HGNC:9905",
            "omim_id": [
                "605313"
            ],
            "ucsc_id": "uc031uto.3"
        },
        "gene_symbol": "RBM8A",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "LRG_574t1:c.437_443dup",
        "hgvs_lrg_variant": "LRG_574:g.6454_6460dup",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_574p1:p.(W148Cfs*2)",
            "lrg_tlr": "LRG_574p1:p.(Trp148CysfsTer2)",
            "slr": "NP_005096.1:p.(W148Cfs*2)",
            "tlr": "NP_005096.1:p.(Trp148CysfsTer2)"
        },
        "hgvs_refseqgene_variant": "NG_032654.2:g.6454_6460dup",
        "hgvs_transcript_variant": "NM_005105.4:c.437_443dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926078_145926084dup",  # the .4 version looks OK
                "vcf": {
                    "alt": "CCAGTCAA",
                    "chr": "1",
                    "pos": "145926076",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "chr1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926078_145926084dup",
                "vcf": {
                    "alt": "CCAGTCAA",
                    "chr": "chr1",
                    "pos": "145926076",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "lrg": "http://ftp.ebi.ac.uk/pub/databases/lrgex/LRG_574.xml",
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_005096.1",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_032654.2",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_005105.4"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NC_000001.10:g.145509010_145509016dup",
        "transcript_description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_005105.4 is available (NM_005105.5): NM_005105.5:c.437_443dup MUST be fully validated prior to use in reports: select_variants=NM_005105.5:c.437_443dup"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NC_000001.11": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NG_032654.2": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "NM_005105.5:c.437_443dup": {
        "alt_genomic_loci": [
            {
                "grch37": {
                    "hgvs_genomic_description": "NW_003871055.3:g.2741491_2741497dup",
                    "vcf": {
                        "alt": "CCAGTCAA",
                        "chr": "HG1287_PATCH",
                        "pos": "2741489",
                        "ref": "C"
                    }
                }
            },
            {
                "hg19": {
                    "hgvs_genomic_description": "NW_003871055.3:g.2741491_2741497dup",
                    "vcf": {
                        "alt": "CCAGTCAA",
                        "chr": "NW_003871055.3",
                        "pos": "2741489",
                        "ref": "C"
                    }
                }
            }
        ],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS72872.1",
                "ensemblgene": null,
                "hgnc": "HGNC:9905",
                "ncbigene": "9939",
                "select": "MANE"
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": true,  # Here is a clue. MANE Select. I wonder if there has been a sequence change to match the transcript to the genome. Next I'm going to check the CIGAR. 
            "map": "1q21.1",
            "note": "RNA binding motif protein 8A",
            "refseq_select": true,
            "variant": "0"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS72872"
            ],
            "ensembl_gene_id": "ENSG00000265241",
            "entrez_gene_id": "9939",
            "hgnc_id": "HGNC:9905",
            "omim_id": [
                "605313"
            ],
            "ucsc_id": "uc031uto.3"
        },
        "gene_symbol": "RBM8A",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_574p1:p.(W148Cfs*2)",
            "lrg_tlr": "LRG_574p1:p.(Trp148CysfsTer2)",
            "slr": "NP_005096.1:p.(W148Cfs*2)",
            "tlr": "NP_005096.1:p.(Trp148CysfsTer2)"
        },
        "hgvs_refseqgene_variant": "",
        "hgvs_transcript_variant": "NM_005105.5:c.437_443dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "grch38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926078_145926084dup",  # This is OK again for the .5 version
                "vcf": {
                    "alt": "CCAGTCAA",
                    "chr": "1",
                    "pos": "145926076",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "chr1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "hg38": {
                "hgvs_genomic_description": "NC_000001.11:g.145926078_145926084dup",
                "vcf": {
                    "alt": "CCAGTCAA",
                    "chr": "chr1",
                    "pos": "145926076",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_005096.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_005105.5"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NC_000001.10:g.145509010_145509016dup",
        "transcript_description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
        "validation_warnings": [],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NC_000001.11": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev5+g69b1a7c",
        "variantvalidator_version": "2.1.1.dev53+g44ed2dd.d20221121",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11"
    }
}
Peter-J-Freeman commented 1 year ago
import json
import VariantValidator
vval = VariantValidator.Validator()
gene = 'RBM8A'
g_and_t = vval.gene2transcripts(gene)
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "current_name": "RNA binding motif protein 8A",
    "current_symbol": "RBM8A",
    "hgnc": "HGNC:9905",
    "previous_symbol": "RBM8",
    "transcripts": [
        {
            "annotations": {
                "chromosome": "1",
                "db_xref": {
                    "CCDS": "CCDS72872.1",
                    "ensemblgene": null,
                    "hgnc": "HGNC:9905",
                    "ncbigene": "9939",
                    "select": "RefSeq"
                },
                "ensembl_select": false,
                "mane_plus_clinical": false,
                "mane_select": false,
                "map": "1q21.1",
                "note": "RNA binding motif protein 8A",
                "refseq_select": true,
                "variant": "0"
            },
            "coding_end": 635,
            "coding_start": 111,
            "description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
            "genomic_spans": {
                "NC_000001.10": {
                    "end_position": 145513536,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 145507733,
                            "genomic_start": 145507557,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 145508075,
                            "genomic_start": 145508016,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 145508284,
                            "genomic_start": 145508207,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 145508611,
                            "genomic_start": 145508475,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 145509052,
                            "genomic_start": 145508916,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "1210=1X1420=1D1740=", # Messy
                            "exon_number": 6,
                            "genomic_end": 145513536,
                            "genomic_start": 145509166,
                            "transcript_end": 4961,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 145507557,
                    "total_exons": 6
                },
                "NC_000001.11": {  # GRCh38 for NM_005105.4
                    "end_position": 145927536,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 145927536,
                            "genomic_start": 145927360,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 145927077,
                            "genomic_start": 145927018,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 145926886,
                            "genomic_start": 145926809,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 145926618,
                            "genomic_start": 145926482,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 145926177,
                            "genomic_start": 145926041,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4372=",
                            "exon_number": 6,
                            "genomic_end": 145925927,
                            "genomic_start": 145921556,
                            "transcript_end": 4961,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": -1,
                    "start_position": 145921556,
                    "total_exons": 6
                },
                "NG_032654.2": {
                    "end_position": 10981,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 5177,
                            "genomic_start": 5001,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 5519,
                            "genomic_start": 5460,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 5728,
                            "genomic_start": 5651,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 6055,
                            "genomic_start": 5919,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 6496,
                            "genomic_start": 6360,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4372=",
                            "exon_number": 6,
                            "genomic_end": 10981,
                            "genomic_start": 6610,
                            "transcript_end": 4961,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 5001,
                    "total_exons": 6
                }
            },
            "length": 4961,
            "reference": "NM_005105.4",
            "translation": "NP_005096.1"
        },
        {
            "annotations": {
                "chromosome": "1",
                "db_xref": {
                    "CCDS": "CCDS72872.1",
                    "ensemblgene": null,
                    "hgnc": "HGNC:9905",
                    "ncbigene": "9939",
                    "select": "RefSeq"
                },
                "ensembl_select": false,
                "mane_plus_clinical": false,
                "mane_select": false,
                "map": "1q21.1",
                "note": "RNA binding motif protein 8A",
                "refseq_select": true,
                "variant": "0"
            },
            "coding_end": 635,
            "coding_start": 111,
            "description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
            "genomic_spans": {
                "LRG_574": {
                    "end_position": 10981,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 5177,
                            "genomic_start": 5001,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 5519,
                            "genomic_start": 5460,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 5728,
                            "genomic_start": 5651,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 6055,
                            "genomic_start": 5919,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 6496,
                            "genomic_start": 6360,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4372=",
                            "exon_number": 6,
                            "genomic_end": 10981,
                            "genomic_start": 6610,
                            "transcript_end": 4961,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 5001,
                    "total_exons": 6
                },
                "NG_032654.2": {
                    "end_position": 10981,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 5177,
                            "genomic_start": 5001,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 5519,
                            "genomic_start": 5460,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 5728,
                            "genomic_start": 5651,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 6055,
                            "genomic_start": 5919,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 6496,
                            "genomic_start": 6360,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4372=",
                            "exon_number": 6,
                            "genomic_end": 10981,
                            "genomic_start": 6610,
                            "transcript_end": 4961,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 5001,
                    "total_exons": 6
                }
            },
            "length": 4961,
            "reference": "LRG_574t1",
            "translation": "LRG_574p1"
        },
        {
            "annotations": {
                "chromosome": "1",
                "db_xref": {
                    "CCDS": "CCDS916.1",
                    "HPRD": "05609",
                    "ensemblgene": null,
                    "hgnc": "HGNC:9905",
                    "ncbigene": "9939",
                    "select": false
                },
                "ensembl_select": false,
                "mane_plus_clinical": false,
                "mane_select": false,
                "map": "1q21.1",
                "note": "RNA binding motif protein 8A",
                "refseq_select": false,
                "variant": "0"
            },
            "coding_end": 635,
            "coding_start": 111,
            "description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
            "genomic_spans": {
                "NC_000001.10": {
                    "end_position": 145513536,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 145507733,
                            "genomic_start": 145507557,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 145508075,
                            "genomic_start": 145508016,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 145508284,
                            "genomic_start": 145508207,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 145508611,
                            "genomic_start": 145508475,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 145509052,
                            "genomic_start": 145508916,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4371=",
                            "exon_number": 6,
                            "genomic_end": 145513536,
                            "genomic_start": 145509166,
                            "transcript_end": 4960,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 145507557,
                    "total_exons": 6
                },
                "NG_032654.1": {
                    "end_position": 10980,
                    "exon_structure": [
                        {
                            "cigar": "177=",
                            "exon_number": 1,
                            "genomic_end": 5177,
                            "genomic_start": 5001,
                            "transcript_end": 177,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 5519,
                            "genomic_start": 5460,
                            "transcript_end": 237,
                            "transcript_start": 178
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 5728,
                            "genomic_start": 5651,
                            "transcript_end": 315,
                            "transcript_start": 238
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 6055,
                            "genomic_start": 5919,
                            "transcript_end": 452,
                            "transcript_start": 316
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 6496,
                            "genomic_start": 6360,
                            "transcript_end": 589,
                            "transcript_start": 453
                        },
                        {
                            "cigar": "4371=",
                            "exon_number": 6,
                            "genomic_end": 10980,
                            "genomic_start": 6610,
                            "transcript_end": 4960,
                            "transcript_start": 590
                        }
                    ],
                    "orientation": 1,
                    "start_position": 5001,
                    "total_exons": 6
                }
            },
            "length": 4973,
            "reference": "NM_005105.3",
            "translation": "NP_005096.1"
        },
        {
            "annotations": {
                "chromosome": "1",
                "db_xref": {
                    "CCDS": "CCDS72872.1",
                    "ensemblgene": null,
                    "hgnc": "HGNC:9905",
                    "ncbigene": "9939",
                    "select": "MANE"
                },
                "ensembl_select": false,
                "mane_plus_clinical": false,
                "mane_select": true,
                "map": "1q21.1",
                "note": "RNA binding motif protein 8A",
                "refseq_select": true,
                "variant": "0"
            },
            "coding_end": 583,
            "coding_start": 59,
            "description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
            "genomic_spans": {
                "NC_000001.10": {
                    "end_position": 145513536,
                    "exon_structure": [
                        {
                            "cigar": "125=",
                            "exon_number": 1,
                            "genomic_end": 145507733,
                            "genomic_start": 145507609,
                            "transcript_end": 125,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 145508075,
                            "genomic_start": 145508016,
                            "transcript_end": 185,
                            "transcript_start": 126
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 145508284,
                            "genomic_start": 145508207,
                            "transcript_end": 263,
                            "transcript_start": 186
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 145508611,
                            "genomic_start": 145508475,
                            "transcript_end": 400,
                            "transcript_start": 264
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 145509052,
                            "genomic_start": 145508916,
                            "transcript_end": 537,
                            "transcript_start": 401
                        },
                        {
                            "cigar": "1210=1X1420=1D1740=",   # Messy
                            "exon_number": 6,
                            "genomic_end": 145513536,
                            "genomic_start": 145509166,
                            "transcript_end": 4909,
                            "transcript_start": 538
                        }
                    ],
                    "orientation": 1,
                    "start_position": 145507609,
                    "total_exons": 6
                },
                "NC_000001.11": { # GRCh38 for NM_005105.5
                    "end_position": 145927484,
                    "exon_structure": [
                        {
                            "cigar": "125=",
                            "exon_number": 1,
                            "genomic_end": 145927484,
                            "genomic_start": 145927360,
                            "transcript_end": 125,
                            "transcript_start": 1
                        },
                        {
                            "cigar": "60=",
                            "exon_number": 2,
                            "genomic_end": 145927077,
                            "genomic_start": 145927018,
                            "transcript_end": 185,
                            "transcript_start": 126
                        },
                        {
                            "cigar": "78=",
                            "exon_number": 3,
                            "genomic_end": 145926886,
                            "genomic_start": 145926809,
                            "transcript_end": 263,
                            "transcript_start": 186
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 4,
                            "genomic_end": 145926618,
                            "genomic_start": 145926482,
                            "transcript_end": 400,
                            "transcript_start": 264
                        },
                        {
                            "cigar": "137=",
                            "exon_number": 5,
                            "genomic_end": 145926177,
                            "genomic_start": 145926041,
                            "transcript_end": 537,
                            "transcript_start": 401
                        },
                        {
                            "cigar": "4372=",
                            "exon_number": 6,
                            "genomic_end": 145925927,
                            "genomic_start": 145921556,
                            "transcript_end": 4909,
                            "transcript_start": 538
                        }
                    ],
                    "orientation": -1,
                    "start_position": 145921556,
                    "total_exons": 6
                }
            },
            "length": 4909,
            "reference": "NM_005105.5",
            "translation": "NP_005096.1"
        }
    ]
}

So there is a sequence disparity between GRCh37 and GRCh38 by the look of things, so next step is to compare the .3, .4 and .5 versions of the transcript

Peter-J-Freeman commented 1 year ago

Can't see any sequence difference between .4 and 5, or gaps. This seems to agree with the expectation

image

WILL EDIT THIS TOMORROW> NEED TO LEAVE FOR NOW!

Also confirmed with

NC_000001.10:g.145509010_145509016dup

NM_005105.3:c.437_443= > NC_000001.11:g.145926083_145926087delinsGTCAA NM_005105.4:c.437_443= > NC_000001.11:g.145926077_145926083= NM_005105.5:c.437_443= > NC_000001.11:g.145926077_145926083=

So the sequence between NC_000001.10 and all transcripts is equal but something is odd about the alignment for the .3 version of the transcript

ifokkema commented 1 year ago

Thank you for diving into it! Is it possible that somewhere internally, VV copies positions over and doesn't look at the sequence? I'm sure there are alignment issues, but that this causes a dup to change to an ins doesn't worry me. What worries me is that the ALT is completely different. It just creates a whole new variant. Just like the NC_000001.10:g.145473384G>A lifting over to a C>T substitution on hg38.

Peter-J-Freeman commented 1 year ago

Thanks @ifokkema. So you are saying that the alternate sequence of the insertion in NC_000001.11:g.145926083_145926084insGTCAACG may also be incorrect??? I am adding this to the list of things to check.

If so, I suspect a strange blend of aligned sequence difference between NM_005105.3 and NC_000001.11 combined with describing the variant at its most 3 prime position (normalization). This has bitten me before. Will look again mon morning

Peter-J-Freeman commented 1 year ago

OK going to check the reference sequences for NM_005105.3:c.437_443= and NC_000001.11:g.145926083_145926087delinsGTCAA hgvs2reference tool on the API

NM_005105.3:c.437_443=

{
  "end_position": "443",
  "error": "",
  "sequence": "TTGACTG",
  "start_position": "437",
  "variant": "NM_005105.3:c.437_443=",
  "warning": ""
}

NC_000001.11:g.145926083_145926087delinsGTCAA

{
  "end_position": "145926087",
  "error": "",
  "sequence": "ACGCT",
  "start_position": "145926083",
  "variant": "NC_000001.11:g.145926083_145926087delinsGTCAA",
  "warning": ""
}

NC_000001.10:g.145509010_145509016dup

{
  "end_position": "145509016",
  "error": "",
  "sequence": "TTGACTG",
  "start_position": "145509010",
  "variant": "NC_000001.10:g.145509010_145509016dup",
  "warning": ""
}

NM_005105.3:c.437_443dup

{
  "end_position": "443",
  "error": "",
  "sequence": "TTGACTG",
  "start_position": "437",
  "variant": "NM_005105.3:c.437_443dup",
  "warning": ""
}

NC_000001.11:g.145926083_145926084insGTCAACG

{
  "end_position": "145926084",
  "error": "",
  "sequence": "AC",
  "start_position": "145926083",
  "variant": "NC_000001.11:g.145926083_145926084insGTCAACG",
  "warning": ""
}
Peter-J-Freeman commented 1 year ago

Right, I have a feeling that what is happening here might be that the ucsc liftover tool is trying to do a direct lift between NC_000001.10 and NC_000001.11 rather than lifting via a transcript. NM_005105.3 does not align to NC_000001.11.

This was something that I was worried about when we were asked to implement direct liftover even if the selected transcript does not map to the alternative build. Note, @ifokkema , you were one user who asked for it ;P lol.

Going to check this theory now. If it is the case, this is useful so we can filter out the occasions where things go a little werid. Hang fire

Peter-J-Freeman commented 1 year ago

OK, this is complex. The reason is that this is a direct lift over between genome builds where the region of interest in GRCh37 is inverted compared to GRCh38. Its causing me headaches already!

leicray commented 1 year ago

NM_005105.3 does not align to NC_000001.11.

What do you mean by "does not align"? I was expecting that NM_005105.3 must be seriously different in sequence compared with NM_005105.5. However, pairwise alignment of the two versions shows that they differ only by one single-nucleotide substitution and by one single-nucleotide gap.

Peter-J-Freeman commented 1 year ago

There is no mapping for GRCh38 for NM_005105.3

Peter-J-Freeman commented 1 year ago

The issue is not to do with transcript to genome alignment, it is genome to genome alignment

leicray commented 1 year ago

The lack of a mapping is probably an omission rather than an alignment being an impossibility.

Peter-J-Freeman commented 1 year ago

Yes, that is correct, but because it is not available we cannot use it so we have to use direct liftover. We only ever use official alignments. It is not oversight as such it is RefSeq Policy

Peter-J-Freeman commented 1 year ago

They do not align old transcripts to GRCh38

Peter-J-Freeman commented 1 year ago

OK, after looking at this all day, I think the only safe thing to do is to simply not return liftovers in instances where the direct lift from genome to genome hits an inverted sequence. It is likely possible that it could be done, but would take a long time and many variants to get it right, and to be honest I do not think it is sufficiently common to warrant the effort.

Peter-J-Freeman commented 1 year ago

This will now look as follows

import json
import VariantValidator
vval = VariantValidator.Validator()
# variant = 'NC_000001.10:g.145509010_145509016dup' # variant 1
variant = 'NC_000001.10:g.145509010_145509016dup'
genome_build = 'GRCh37'
select_transcripts = 'NM_005105.3'
transcript_set = 'refseq'
validate = vval.validate(variant, genome_build, select_transcripts, transcript_set)
validation = validate.format_as_dict(with_meta=True)
print(json.dumps(validation, sort_keys=True, indent=4, separators=(',', ': ')))
{
    "NM_005105.3:c.437_443dup": {
        "alt_genomic_loci": [],
        "annotations": {
            "chromosome": "1",
            "db_xref": {
                "CCDS": "CCDS916.1",
                "HPRD": "05609",
                "ensemblgene": null,
                "hgnc": "HGNC:9905",
                "ncbigene": "9939",
                "select": false
            },
            "ensembl_select": false,
            "mane_plus_clinical": false,
            "mane_select": false,
            "map": "1q21.1",
            "note": "RNA binding motif protein 8A",
            "refseq_select": false,
            "variant": "0"
        },
        "gene_ids": {
            "ccds_ids": [
                "CCDS72872"
            ],
            "ensembl_gene_id": "ENSG00000265241",
            "entrez_gene_id": "9939",
            "hgnc_id": "HGNC:9905",
            "omim_id": [
                "605313"
            ],
            "ucsc_id": "uc031uto.3"
        },
        "gene_symbol": "RBM8A",
        "genome_context_intronic_sequence": "",
        "hgvs_lrg_transcript_variant": "",
        "hgvs_lrg_variant": "",
        "hgvs_predicted_protein_consequence": {
            "lrg_slr": "LRG_574p1:p.(W148Cfs*2)",
            "lrg_tlr": "LRG_574p1:p.(Trp148CysfsTer2)",
            "slr": "NP_005096.1:p.(W148Cfs*2)",
            "tlr": "NP_005096.1:p.(Trp148CysfsTer2)"
        },
        "hgvs_refseqgene_variant": "NG_032654.1:g.6454_6460dup",
        "hgvs_transcript_variant": "NM_005105.3:c.437_443dup",
        "primary_assembly_loci": {
            "grch37": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "1",
                    "pos": "145509008",
                    "ref": "C"
                }
            },
            "hg19": {
                "hgvs_genomic_description": "NC_000001.10:g.145509010_145509016dup",
                "vcf": {
                    "alt": "CGTTGACT",
                    "chr": "chr1",
                    "pos": "145509008",
                    "ref": "C"
                }
            }
        },
        "reference_sequence_records": {
            "protein": "https://www.ncbi.nlm.nih.gov/nuccore/NP_005096.1",
            "refseqgene": "https://www.ncbi.nlm.nih.gov/nuccore/NG_032654.1",
            "transcript": "https://www.ncbi.nlm.nih.gov/nuccore/NM_005105.3"
        },
        "refseqgene_context_intronic_sequence": "",
        "rna_variant_descriptions": null,
        "selected_assembly": "GRCh37",
        "submitted_variant": "NC_000001.10:g.145509010_145509016dup",
        "transcript_description": "Homo sapiens RNA binding motif protein 8A (RBM8A), mRNA",
        "validation_warnings": [
            "A more recent version of the selected reference sequence NM_005105.3 is available (NM_005105.5): NM_005105.5:c.437_443dup MUST be fully validated prior to use in reports: select_variants=NM_005105.5:c.437_443dup"
        ],
        "variant_exonic_positions": {
            "NC_000001.10": {
                "end_exon": "5",
                "start_exon": "5"
            },
            "NG_032654.1": {
                "end_exon": "5",
                "start_exon": "5"
            }
        }
    },
    "flag": "gene_variant",
    "metadata": {
        "variantvalidator_hgvs_version": "2.0.2.dev5+g69b1a7c",
        "variantvalidator_version": "2.1.1.dev53+g44ed2dd.d20221121",
        "vvdb_version": "vvdb_2022_11",
        "vvseqrepo_db": "VV_SR_2022_11/master",
        "vvta_version": "vvta_2022_11"
    }
}
ifokkema commented 1 year ago

This was something that I was worried about when we were asked to implement direct liftover even if the selected transcript does not map to the alternative build. Note, @ifokkema , you were one user who asked for it ;P lol.

Haha, I wouldn't be surprised, we'd definitely need that feature. But that doesn't mean the sequence shouldn't be checked, Mutalyer2-style :laughing:

OK, after looking at this all day, I think the only safe thing to do is to simply not return liftovers in instances where the direct lift from genome to genome hits an inverted sequence. It is likely possible that it could be done, but would take a long time and many variants to get it right, and to be honest I do not think it is sufficiently common to warrant the effort.

I'm wondering if perhaps we could automatically determine that sequence differences between different versions of transcripts are minor or even absent, so we could copy the alignment data from newer versions back to older versions that are missing an alignment? In this case, this transcript is the only one we have for this gene. Do you have any suggestions on how we could determine the hg38 descriptions for our variants? Perhaps otherwise by first enforcing a manual switch to a higher version of the transcript? Or attempting to map all hg19 descriptions to the new transcript, and if the DNA descriptions are the same as we currently have, just update the transcript automatically? :thinking:

Peter-J-Freeman commented 1 year ago

I'm wondering if perhaps we could automatically determine that sequence differences between different versions of transcripts are minor or even absent, so we could copy the alignment data from newer versions back to older versions that are missing an alignment? In this case, this transcript is the only one we have for this gene. Do you have any suggestions on how we could determine the hg38 descriptions for our variants? Perhaps otherwise by first enforcing a manual switch to a higher version of the transcript? Or attempting to map all hg19 descriptions to the new transcript, and if the DNA descriptions are the same as we currently have, just update the transcript automatically? šŸ¤”

I do not particularly want to use internal alignment strategies, but rather stick to official alignments from NIH as we do now. That being said, it might be possible to identify instances where the genome to genome lift-over is aligned within an inverted sequence, to go from transcript to genome to higher transcript, then lift over. However, the caveat is that the alignment would come from the higher version transcript, but that is at least an official alignment. So for this particular variant, the workflow would be

NC_000001.10:g.145509010_145509016dup > NM_005105.3:c.437_443dup > NC_000001.10:g.145509010_145509016dup > NM_005105.4:c.437_443dup > NC_000001.11:g.145926077_145926083dup

This assumes the input description is the genomic description. If transcript is is

NM_005105.3:c.437_443dup > NC_000001.10:g.145509010_145509016dup > NM_005105.4:c.437_443dup > NC_000001.11:g.145926077_145926083dup

FYI, it is not that the sequence is not checked. The sequence is checked and was accurate. However, because of the technical difficulties with inversed alignments, the code that put together the sequence got it wrong šŸ˜†. This is the first time I have seen this and it is way more complicated than you would think. So, the above solution is most likely to be the one that works.

I will re-open and wait for @ifokkema and @leicray to say if this solution is acceptable. If so, I can likely get it done pretty quickly

leicray commented 1 year ago

I agree with the idea that we should stick with official alignments from NIH. That way, nobody can argue with how the results are generated. A VV-specific solution might be insufficiently universal or robust and might potentially harm VV's well-earned reputation.

Another thing to consider is the span of the inversion. The example being discussed here is presumably an inversion which has breakpoints outside of the gene. In other words, the entire gene is inverted relative to its true orientation. In principle, a relatively simple solution could be devised to handle this.

However, might there be non-simple inversions in GRCh37 for which the breakpoints are not tidily outside the gene? The IL12RB2 gene had an intra-genic inversion in an early draft version:

IL12RB2

It's perhaps instructive to look at the three versions of NM_001559 at https://www.ncbi.nlm.nih.gov/nuccore/NM_001559.3

I'm not suggesting that such glaring mis-assemblies persist in GRCh37, but minor comparable issues might persist that will be difficult to handle in a simple fashion.

Peter-J-Freeman commented 1 year ago

Agreed. There are certainly 2 ways to go, but I will have to admit that just trying to figure out how we might lift from genome to genome in these instances via a g.hgvs > vcf > vcf > g.hgvs conversion and update the sequence (which is more tricky than you might think because we go 3-prime normalization > 5 prime normalization > 5 prime normalization > 3-prime normalization as well)........

Yeah, it was mind boggling. I think it will be safer and more reliable to go g.hgvs > c.hgvs > g.hgvs > c.hgvs > g.hgvs-lifted. Also, going this way, the sequence is much easier to test and validate each time as it used internal functions rather than the ucsc liftover.

ifokkema commented 1 year ago

NC_000001.10:g.145509010_145509016dup > NM_005105.3:c.437_443dup > NC_000001.10:g.145509010_145509016dup > NM_005105.4:c.437_443dup > NC_000001.11:g.145926077_145926083dup

This assumes the input description is the genomic description. If transcript is is

NM_005105.3:c.437_443dup > NC_000001.10:g.145509010_145509016dup > NM_005105.4:c.437_443dup > NC_000001.11:g.145926077_145926083dup

That's excellent!

I think it will be safer and more reliable to go g.hgvs > c.hgvs > g.hgvs > c.hgvs > g.hgvs-lifted. Also, going this way, the sequence is much easier to test and validate each time as it used internal functions rather than the ucsc liftover.

Totally agree!

Peter-J-Freeman commented 1 year ago

Won't be this next release. But will do this ASAP. Bit slammed at the moment

ifokkema commented 1 year ago

No worries! I'm not waiting for it! Not even this year, to be honest. Plenty of other stuff on my list, too...