openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Exon numbering - Gather user requirements #251

Closed i3hsInnovation closed 2 years ago

i3hsInnovation commented 3 years ago

We have had several requests to provide exon numbering. I have always avoided this because it is a total mine field

Although this is a seemingly trivial thing to do but it is actually hugely complicated because of how people like to number exons. For example,

I would like to gather user requirements around this.

vidboda commented 3 years ago

Hi,

First, what do you mean by "Classical numbering"?

And between the 2 others, I strongly believe that the useful one would be "Exon in the context of individual transcripts", at least for the medical practice, which must be a major use case of VariantValidator. At MobiDetails, we would be very very happy to get this one from VV.

i3hsInnovation commented 3 years ago

I would argue in counter that numbering in the context of individual transcripts is the most confusing and potentially the most dangerous and must be handled very carefully. The reason is that for many genes an exon at the chromosomal (DNA) level will be differentially used in different transcripts. Therefore it could be exon 4 in the context of transcript A and exon 2 in the context of Transcript B. Clinicians have limited time to look at reposts and I have seen this mistake lead to confusion.

I agree that transcript by transcript exon numbering is useful but in my opinion we also need a gene-based numbering system where each exon in the gene at the DNA is numbered to prevent confusion.

I would also like for each transcript to have exon 2 of 20 or something so we know how many exons are remaining. This might also feed into tooling that predicts whether NMD is likely or not.

Sorry for the idea spurge. Busy day!

i3hsInnovation commented 3 years ago

Raymond I'm sure con comment on classic numbering of exons

vidboda commented 3 years ago

Therefore it could be exon 4 in the context of transcript A and exon 2 in the context of Transcript B. Clinicians have limited time to look at reposts and I have seen this mistake lead to confusion.

Of course and that is the main point of "transcript numbering of exons". If the transcript number is provided, I tend ton consider it as less confusing than "gene numbering of exons". For example in this case you can encounter an exon 25 in a transcript which contains less than 25 exons (say if the gene has 50 exons, and transcript B begins at exon 20 and finishes at exon 30). How would you report exon 25 of 50 for a transcript that contains 10 exons?

"Transcript numbering of exons" is more widely used and understood, in my opinion (UCSC for example).

And the idea of providing the total number of exons (of the transcript!! :) ) is excellent.

i3hsInnovation commented 3 years ago

At the transcript level I entirely agree, but I have seen many mistakes and instances of confusion made by healthcare trainees and clinical geneticists alike who cannot appreciate that gene based exon numbering and transcript based exon numbering gives different results. Rather than ignoring this, I would like to think about tackling it.

I'm thinking we should display an exon number based in the location in the transcript. I agree that the exon x of y would also be very useful information at this level. So a structure like

{"exonic_location": 
    {"transcript_level": 
        {"exon_intron_number": "X",
         "total_exons": "Y",
         "total_introns: "Z"
         },
     "genetic_level": 
        {"exon_intron_number": "X",
         "total_exons": "Y",
         "total_introns: "Z"
         }
    }
}

I think this will avoid confusion and help with navigation. Certainly in my experience as a trainer this would be useful. For example, BRCA transcript variants have been described as delta9-11. This means a transcript that lacks exons 9, 10 and 11 at the DNA level. Since a lot of this data still persists in the literature, it would be great to have a way of helping users navigate this. Including me :)

i3hsInnovation commented 3 years ago

Although exon numbering at the DNA level can be tricky especially if you use Ensembl data which we intend to support soon. Definately needs some thought. Perhaps I'm over-complicating it.

Any other thoughts please.

ifokkema commented 3 years ago

Hi guys,

My 2 cents:

  • Exon in the context of individual transcripts

This is the most common.

  • Exon in the context of the whole gene

This is rarely used.

  • Classical numbering

This should be abolished :wink:

(...) I strongly believe that the useful one would be "Exon in the context of individual transcripts"

Agreed.

I would argue in counter that numbering in the context of individual transcripts is the most confusing and potentially the most dangerous and must be handled very carefully. The reason is that for many genes an exon at the chromosomal (DNA) level will be differentially used in different transcripts. Therefore it could be exon 4 in the context of transcript A and exon 2 in the context of Transcript B. Clinicians have limited time to look at reposts and I have seen this mistake lead to confusion.

I would say the opposite is true; in biology, rarely it's this simple. Exons may have the same start but not the same end in every transcript, or the same end but not the same start. If you find a new transcript, you might need to relabel all the exons, which adds to even more confusion (and what kind of proof would you need before you will consider this transcript? and will you consider all RefSeq or also Ensembl transcripts?) You would need very clear rules for this naming, and honestly, clinicians who would use this numbering, apparently don't know which transcripts are relevant, and thus wouldn't (or shouldn't?) really use the exon numbering anyway.

I agree that transcript by transcript exon numbering is useful but in my opinion we also need a gene-based numbering system where each exon in the gene at the DNA is numbered to prevent confusion.

I doubt it'll prevent confusion, it'll just add confusion to the far majority not using this numbering.

Therefore it could be exon 4 in the context of transcript A and exon 2 in the context of Transcript B. Clinicians have limited time to look at reposts and I have seen this mistake lead to confusion.

Of course and that is the main point of "transcript numbering of exons". If the transcript number is provided, I tend to consider it as less confusing than "gene numbering of exons". For example in this case you can encounter an exon 25 in a transcript that contains less than 25 exons (say if the gene has 50 exons, and transcript B begins at exon 20 and finishes at exon 30). How would you report exon 25 of 50 for a transcript that contains 10 exons?

And worse, this may suddenly become 26/51 because some new transcript has been found, while the transcript that's relevant isn't changing at all.

"Transcript numbering of exons" is more widely used and understood, in my opinion (UCSC for example).

I totally agree.

At the transcript level I entirely agree, but I have seen many mistakes and instances of confusion made by healthcare trainees and clinical geneticists alike who cannot appreciate that gene based exon numbering and transcript based exon numbering gives different results.

I think their training needs to be improved rather than the existing experts confused.

{"exonic_location": 
    {"transcript_level": 
        {"exon_intron_number": "X",
         "total_exons": "Y",
         "total_introns: "Z"
         },
     "genetic_level": 
        {"exon_intron_number": "X",
         "total_exons": "Y",
         "total_introns: "Z"
         }
    }
}

Total introns is always total exons minus 1, so doesn't need to be stored separately. We do need to think about how to indicate whether the variant is in the exon or the intron. Some LOVD curators use "4i" to indicate "intron 4", which lies after exon 4.

For example, BRCA transcript variants have been described as delta9-11. This means a transcript that lacks exons 9, 10 and 11 at the DNA level. Since a lot of this data still persists in the literature, it would be great to have a way of helping users navigate this. Including me :)

Sure, but do we have these descriptions using gene-based numbering, which completely depends on which transcripts were known at the time? You can't even interpret these if they're gene-based.

i3hsInnovation commented 3 years ago

Haha. 2 for 1 in favour of transcript level numbering only. Perhaps I am being a little over cautious ;P

I think that the argument I tripped myself on, which @ifokkema expanded on is that at the DNA level the exon number seems to be in constant flux, so would be entirely dependant on database version.

From my point of view, dropping gene level numbering would be much easier :)

I think their training needs to be improved rather than the existing experts confused

Supposed experts are always confused in my experience and have you ever tried training one??? Its a nightmare

i3hsInnovation commented 3 years ago

To clarify the structure, we seem to be leaning towards

{"exonic_location": 
        {"exon_number": "X", # or ("Xi" - Which I think is neat)
         "total_exons": "Y",
         },
}
ifokkema commented 3 years ago

From my point of view, dropping gene level numbering would be much easier :)

Let's :wink:

I think their training needs to be improved rather than the existing experts confused

Supposed experts are always confused in my experience and have you ever tried training one??? Its a nightmare

Hahaha :laughing: I'm sure you have interesting examples - I only know the "pleaaaaase stop using your spreadsheets and just work directly in the database" and the "please drop your legacy exon numbering... :roll_eyes: " discussions :laughing:

To clarify the structure, we seem to be leaning towards

{"exonic_location": 
        {"exon_number": "X", # or ("Xi" - Which I think is neat)
         "total_exons": "Y",
         },
}

If you like the "Xi", we could then even say:

{
    "exonic_location": "X/Y"
},
{
    "exonic_location": "Xi/Y"
}

unless some tools really rather prefer making their own formatting and don't want to parse this output (@beboche ?). I'd drop everything from the / onwards I think, which is easy enough to do.

i3hsInnovation commented 3 years ago

I'm OK with either of these structures. @beboche , any objections to either?

vidboda commented 3 years ago

unless some tools really rather prefer making their own formatting and don't want to parse this output (@beboche ?). I'd drop everything from the / onwards I think, which is easy enough to do.

No problem for me to parse this. What about 5 and 3'UTRs?

vidboda commented 3 years ago

no pb with the proposed structures but maybe the most rigourous would look like

{"gene_region": 
        {
         "region_type": "exon/intron/(UTR?)",
         "region_number": "X",
         "total_exons": "Y",
         },
}
i3hsInnovation commented 3 years ago

These are just parts of exons, in my opinion. I have hopefully got some students building in some Sequence Ontology terms though that would define this -. i.e. something like 5prime_UTR_Variant (or whatever this ought to be)

One thought, I assume we take the start position of the variant?

What I mean is that is a deletion starts in exon 1 and finishes in exon 2 or intron 1, do we simply say exon 1 or do we create a start and end dictionary ????

ifokkema commented 3 years ago

No problem for me to parse this. What about 5 and 3'UTRs?

UTRs are exons. So to specially annotate UTRs and the CDS, will make this a different discussion. Then we'd also get close to protein domains, but that's a whole different field.

no pb with the proposed structures but maybe the most rigourous would look like

{"gene_region": 
        {
         "region_type": "exon/intron/(UTR?)",
         "region_number": "X",
         "total_exons": "Y",
         },
}

True (but without the UTR, I'd say), here everybody can choose to create their own format, with a bit more processing on our side.

One thought, I assume we take the start position of the variant?

Ah, good point, totally forgot about that.

What I mean is that is a deletion starts in exon 1 and finishes in exon 2 or intron 1, do we simply say exon 1 or do we create a start and end dictionary ????

Better split it, indeed! LOVD would use 5_6 for a variant spanning exons 5 through 6. A deletion of exons 5 and 6 will then be (strange, perhaps?) 4i_6i because the variant starts in the 4th intron and ends in the 6th intron.

vidboda commented 3 years ago

What I mean is that is a deletion starts in exon 1 and finishes in exon 2 or intron 1, do we simply say exon 1 or do we create a start and end dictionary ????

Right, the start and end dictionnary looks the better way.

i3hsInnovation commented 3 years ago

Great. I agree that UTRs are exons and we will tackle these with the Sequence Ontology terms then if that OK with all.

So, I think the final structure should be something like

Exonic

{"exonic_location": 
    {"start": "X/Y",
     "end": "X/Y"}
}

intronic

{"exonic_location": 
    {"start": "Xi/Y",
     "end": "Xi/Y"}
}

Think this is simple parsable and informative ?

vidboda commented 3 years ago

Think this is simple parsable and informative ?

Absolutely - can't wait for it!

i3hsInnovation commented 3 years ago

The good news is I may have a couple of volunteers wanting to put together the prototype v1 code for me as part of an assignment, so hopefully we can get it up and running pretty quickly.

Keep an eye on this thread in case they need to ask questions. Thanks for working this through @beboche and @ifokkema

ifokkema commented 3 years ago

Think this is simple parsable and informative ?

Perfect!

John-F-Wagstaff commented 3 years ago

Just to note that we are still going to have issues with alt mappings, even if we go with the above per transcript notation. The same transcript as mapped to differing alt versions of reference locations can produce different ideas of the exon positions, and possibly even differing ideas on the total number of exons. So I think we may need to include the target sequence for the alignment as well, or otherwise account for this.

To give an example NM_001113182.3 has 4 different exon position sets, in the current VariantValidatior Transcript Archive (VVTA) data, these come from 14 different target sequences from both GRCh37 and GRCh38, even the total number of exons is not the same, with the span of 268-1601 being either just exon 2 or 2 of 268-689 followed by 3 of 689-1601.

ifokkema commented 3 years ago

To give an example NM_001113182.3 has 4 different exon position sets, in the current VariantValidatior Transcript Archive (VVTA) data, these come from 14 different target sequences from both GRCh37 and GRCh38, even the total number of exons is not the same, with the span of 268-1601 being either just exon 2 or 2 of 268-689 followed by 3 of 689-1601.

That's really interesting! What is the source of these locations? The UCSC genome browser indicates only one location in both hg19 and hg38, and Variant Validator currently has only one position on hg19 and hg38 for this transcript's predecessor, NM_001113182.2. I would (carefully) suggest that we don't want these "alternative" locations?

i3hsInnovation commented 3 years ago

Good point. Thanks for also providing an example transcript.

I think the easy solution will be that we provide the exon numbering in the context of a genomic reference sequence NC_ for the selected genome build where possible.

Need to think about what to do in cases where the transcript only maps to an ALT.

In terms of creating the actual module, my recommendation would be to require a transcript id and a genomic id (which can be Chromosome NC, RefSeqGene NG, Alt NT or patch NW. I will then have to think about how to prioritise this when I plug it in.

In the mean time, keep the actual module simple.

i3hsInnovation commented 3 years ago

The UCSC genome browser indicates only one location in both hg19 and hg38

Not quite true. You can find the Alt loci etc on UCSC genome browser if you know their name for the scaffold.

All our data come directly from source. NCBI or Ensembl.

Some genes only map to ALT assemblies HLA-DRB4

Others like SHANK3 are aligned to GRCh37 and GRCh38 but these builds have exons missing, so it only fully aligns with an ALT.

I agree to ignore ALTS for most genes, but I'm expecting annoying edge cases during future revisions :)

i3hsInnovation commented 3 years ago

This is why we love genomics right?????

i3hsInnovation commented 3 years ago

And dare I say it that the Ensembl method for creating transcripts is more robust in this setting. One transcript version only ever has 1 alignment to 1 genomic reference. Period!!!

John-F-Wagstaff commented 3 years ago

Yes, and thanks Pete for the follow up. For this transcript the versions with the higher exon counts map to NT_167249.2 (or NT167249.1 for GRCh37) it would also take a bit more digging but I could probably get a list of things with different exon positions between NC sequences ie between GRCh38 and 37 if it would be particularly useful. I saw a few when working on this bit but filtering them out is a bit of a chore.

i3hsInnovation commented 3 years ago

Any example please dump here. Thanks for this one John. This is a really useful one to consider. I can't exactly remember how VV behaves in these instances.

What I will say is that I want to update the data structure

{"exonic_location": 
    {"transcript_id": "NM_......",
     "genomic_id": "NC_.......",
     "start": "X/Y",
     "end": "X/Y"}
}

May seem a little anal, but it will prevent any miss-identification of data and can be ignored by other platforms

ifokkema commented 3 years ago

I think the easy solution will be that we provide the exon numbering in the context of a genomic reference sequence NC_ for the selected genome build where possible.

Are we sure (although this might be pushing it) that all transcripts never map twice to the same NC? :stuck_out_tongue_closed_eyes:

Need to think about what to do in cases where the transcript only maps to an ALT.

In my case, I don't need these. I can't (nor want to) store them anyway.

The UCSC genome browser indicates only one location in both hg19 and hg38

Not quite true. You can find the Alt loci etc on UCSC genome browser if you know their name for the scaffold.

I saw those, but I'm so used to ignoring them that I didn't even think anymore they were part of the genome builds. But yeah, I guess they are!

I agree to ignore ALTS for most genes, but I'm expecting annoying edge cases during future revisions :)

LOVD will always ignore them - we only focus on the actual chromosomes.

And dare I say it that the Ensembl method for creating transcripts is more robust in this setting. One transcript version only ever has 1 alignment to 1 genomic reference. Period!!!

So all PAR genes will align only to X and you won't know where to Y they align?

{"exonic_location": 
    {"transcript_id": "NM_......",
     "genomic_id": "NC_.......",
     "start": "X/Y",
     "end": "X/Y"}
}

Where will this go? I suppose in NC_...g.... > hgvs_t_and_p > NM_.... > exonic_location? In that case, transcript_id is very redundant. Should genomic_id be a key instead? That's more in line with the rest of the API output.

i3hsInnovation commented 3 years ago

Are we sure (although this might be pushing it) that all transcripts never map twice to the same NC? ๐Ÿ˜

Ah, I was kind of stating this because they do map differently sometimes between GRCh37 and GRCh38 so we do need to know the genome build so we know which NC_to pass into the function. Hope that is clearer. Thanks

So all PAR genes will align only to X and you won't know where to Y they align?

Yet another complication. For X and Y we will need to make 2 dictionaries. HEEEEEEEELP!!!!!!!

For Ensembl, last I looked they create a separate gene entry to allow alignments to different chromosomes. So a PAR gene will I believe have 2 gene ENSG gene entries in Ensembl and each will have its own set of transcripts. That make sense?

i3hsInnovation commented 3 years ago

Agreed Where will this go? I suppose in NC_...g.... > hgvs_t_andp > NM.... > exonic_location? In that case, transcript_id is very redundant. Should genomic_id be a key instead? That's more in line with the rest of the API output.

Actually the way round this might be to embed the dictionary into section

    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        }
      },
      "grch38": {
        "hgvs_genomic_description": "NC_000017.11:g.50198002C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "50198002",
          "ref": "C"
        }
      },
      "hg19": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "chr17",
          "pos": "48275363",
          "ref": "C"
        }
      },
      "hg38": {
        "hgvs_genomic_description": "NC_000017.11:g.50198002C>A",
        "vcf": {
          "alt": "A",
          "chr": "chr17",
          "pos": "50198002",
          "ref": "C"
        }
      }

Then under each we can just add the simple

{"exonic_location": 
    {"start": "X/Y",
     "end": "X/Y"}
}

dictionary.

This seems like a nice approach as well because we can quickly identify alignment differences between GRCh37 and 38. I will also add to section

"alt_genomic_loci": [],

This way all the data are available and can be used if one chooses to

i3hsInnovation commented 3 years ago

Example

    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        },
        "exonic_location": 
            {"start": "X/Y",
             "end": "X/Y"
        }, 

..... REST of JSON
ifokkema commented 3 years ago

Are we sure (although this might be pushing it) that all transcripts never map twice to the same NC? stuck_out_tongue_closed_eyes

Ah, I was kind of stating this because they do map differently sometimes between GRCh37 and GRCh38 so we do need to know the genome build so we know which NC_to pass into the function. Hope that is clearer. Thanks

I know, but if that one transcript had 14 places to align to, are we sure that's never on the same NC? I know that would be crazy, but well... we've seen crazy things already :sweat_smile:

So all PAR genes will align only to X and you won't know where to Y they align?

Yet another complication. For X and Y we will need to make 2 dictionaries. HEEEEEEEELP!!!!!!!

Haha :) But some of the VV endpoints already did this, right?

For Ensembl, last I looked they create a separate gene entry to allow alignments to different chromosomes. So a PAR gene will I believe have 2 gene ENSG gene entries in Ensembl and each will have its own set of transcripts. That make sense?

That would be data duplication... annoying :disappointed:

Example

Great! I guess we'll always have mapping to the genome, right? We don't have endpoints processing NMs without mapping? If so, you could fallback using the annotation from the NM itself (which also contains the exons annotated), right?

i3hsInnovation commented 3 years ago

I know, but if that one transcript had 14 places to align to, are we sure that's never on the same NC? I know that would be crazy, but well... we've seen crazy things already ๐Ÿ˜…

Expect the unexpected

Haha :) But some of the VV endpoints already did this, right?

Yes, so thats why I'll add the exon numbers to the ALT loci as well because I believe VV adds chrX to the Primary assembly and chrY to the ALT alignments. This way we are providing the exon numbers for both simultaneously which thankfully neatly wraps up the PAR issue

Great! I guess we'll always have mapping to the genome, right? We don't have endpoints processing NMs without mapping? If so, you could fallback using the annotation from the NM itself (which also contains the exons annotated), right?

Exactly, all NM have a genomic mapping (I did once find a transcript with no GRCh37 or 38 alignments, just a RefSeqGene NG). So for gene variants (i.e. have hit a transcript) the dictionary will be

    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        },
        "exonic_location": 
            {"start": "X/Y",
             "end": "X/Y"
        }, 

..... REST of JSON

For intergenic

    "primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        },
        "exonic_location": {}, 

..... REST of JSON

In my mind this has now wrapped up all possibilities except for data duplication which is not my problem ;)

Think this works???

i3hsInnovation commented 3 years ago

Also, @John-F-Wagstaff can possibly answer, but I think from Reece Harts comments in his work, some genes/transcripts could have more than one alignment in a single chromosome. Might be an old issue or my old brain playing up though

John-F-Wagstaff commented 3 years ago

You can't have more than one, transcript id, target id, mapping method, set the same, but if you have both blat and splign mappings for the same RefSeq transcript then yes you can. I believe that you only get results for one method at a time for some of the queries, which for n<->g mappings in hgvs gets set usually by vvhgvs.global_config.mapping.alt_aln_method. As ensemble data has it's mapping method set to genebuild instead we will have to handle this somehow, if we want queries for both. If RefSeq alters the settings or otherwise modifies the alignment process then the mapping of the same transcript can change, then we have to chose between keeping the old possibly bad data, or dumping it to replace, silently changing the data for the end users, fun.

ifokkema commented 3 years ago

Haha :) But some of the VV endpoints already did this, right?

Yes, so thats why I'll add the exon numbers to the ALT loci as well because I believe VV adds chrX to the Primary assembly and chrY to the ALT alignments. This way we are providing the exon numbers for both simultaneously which thankfully neatly wraps up the PAR issue

VV does indeed, but the LOVD endpoint considers them primary (which is more logical to me, so please leave it :smile: ). Although it doesn't seem to fully work currently... as hg19 and GRCh37 doesn't give me mappings for chrY for SHOX even though gene2transcripts does... :raised_eyebrow:

Think this works???

Great!

(...) If RefSeq alters the settings or otherwise modifies the alignment process then the mapping of the same transcript can change, then we have to choose between keeping the old possibly bad data, or dumping it to replace, silently changing the data for the end-users, fun.

Ugghhh... can we perhaps detect these and hold off the update, so we can then decide on whether to keep, replace, or perhaps manually curate in case it's not that many entries?

i3hsInnovation commented 3 years ago

VV does indeed, but the LOVD endpoint considers them primary (which is more logical to me, so please leave it ๐Ÿ˜„ ). Although it doesn't seem to fully work currently... as hg19 and GRCh37 doesn't give me mappings for chrY for SHOX even though gene2transcripts does... ๐Ÿคจ

Thats not quite correct. The LOVD end-point does not consider the genomic descriptions to be anything (does not categorise anything). The LOVD endpoint is as a straight forward genomic to transcript to protein mapper so does not lift over onto all available genomic reference sequences like VV does.

The reason the VV endpoints puts Y into ALTS is because of dictionary structures. We can only have 1 primary assembly GRCh37 or 38. By putting Y into ALT we can provide the genomic HGVS for both X and Y PAR in the output. What you have to remember is that the PAR regions break/bend the idea of a primary assembly. Especially yesterday when we discussed a single transcript only mapping once to a single chromosome. The PAR kind of twists this up because it causes 2 mappings to the primary assembly and not 1. So to me placing the Y into the ALT loci, Well "its Logic Jim but not as we know it .... " ๐Ÿ˜„๐Ÿ˜„๐Ÿ˜„

On the up side none of the proposed changes will be in the LOVD endpoint because it does not categorise. ๐Ÿ‘ ๐Ÿ˜„

Thanks for the info about chrY for SHOX. Have you got any examples so I can take a look and debug? Clearly the mappings are there siince genes2transcripts does indeed have them.

Ugghhh... can we perhaps detect these and hold off the update, so we can then decide on whether to keep, replace, or perhaps manually curate in case it's not that many entries?

We really need to update this database like a couple of years ago. We are getting a lot of grief from users about missing data (updated transcripts etc). John, any idea how many transcripts this affects and which genes. What I can do is temporarily keep two dev servers running (if ever IT figure out why I cant spin up the new ones) and for the genes affected Ivo could point LOVD at the old version running on UTA and all other genes at the new version running VVTA.

We could also run LOVD against both versions perhaps and see what happens to the interpretations. This might be a really nice paper showing that RefSeq need to stop this really bad practice of silently updating alignments.

i3hsInnovation commented 3 years ago

Guys, we are starting to sound really geeky here!!!! Think we need to go to the pub and invent something!

John-F-Wagstaff commented 3 years ago

@ifokkema As I said we get to chose, hold off on changing when there is a change in the alignment spans, or overwrite. In the case when we hold off we get a list of the skipped transcripts, which we could then use to manually check and re-load if we wish, so in principal yes. Either way though the change (or lack of change) is not easy to make visible to the users.

@PeteCausey-Freeman yes I can do a conservative database version if it would help, I did not keep my logs of the last full load which I set to skip and log, so you will have to wait a bit to get the affected list though.

i3hsInnovation commented 3 years ago

Very thorough work @John-F-Wagstaff. Great stuff. So we could update the whole database but keep the old alignments intact for now?

We really need to add all the missing data i.e. new transcripts to VVTA yesterday. However, I'm happy enough to keep the current alignments for the transcripts that have been silently updated for now if this works for @ifokkema.

I then propose we log the genes affected. Spin up a dev server running a version on VVTA with all data including alignments updated and run LOVD against this server to see how much data are affected. No one has looked at the actual impacts that these silent changes can have on clinical data, so since Ivo will need to update everything anyway to accommodate the new alignments, to me it makes sense that we do something useful with it.

@John-F-Wagstaff. How much additional work is it to crate a VVTA with the old alignments maintained and also a version with all alignments updated? If its not too much work, perhaps generate both and put both onto the database server???

Thoughts please

ifokkema commented 3 years ago

VV does indeed, but the LOVD endpoint considers them primary (which is more logical to me, so please leave it :smile: ). Although it doesn't seem to fully work currently... as hg19 and GRCh37 doesn't give me mappings for chrY for SHOX even though gene2transcripts does... raised_eyebrow

That's not quite correct. The LOVD end-point does not consider the genomic descriptions to be anything (does not categorise anything).

There's an empty alt_genomic_loci and a populated primary_assembly_loci, so...? :shrug: I may be using the wrong terms here, but depending on the endpoint, I have to look for chrY mappings in either alt_genomic_loci or primary_assembly_loci, which is inconsistent. As long as I know, it doesn't really matter since I can adapt my VV object to parse both depending on the endpoint, and fetch what I need anyway, and new users might not use the LOVD endpoint so might not notice the inconsistency.

The reason the VV endpoints puts Y into ALTS is because of dictionary structures. We can only have 1 primary assembly GRCh37 or 38. By putting Y into ALT we can provide the genomic HGVS for both X and Y PAR in the output.

I see now the structure is slightly different indeed; is there a reason for this?

LOVD, simplified:

"primary_assembly_loci": {
    "grch38": {
        "NC_000023.11": {
            "hgvs_genomic_description": "NC_000023.11:g.630997del"
        },
        "NC_000024.10": {
            "hgvs_genomic_description": "NC_000024.10:g.630997del"
        }
    }
}

VV, simplified:

"alt_genomic_loci": [
    {
        "grch38": {
            "hgvs_genomic_description": "NC_000024.10:g.630997del"
        }
    }
],
"primary_assembly_loci": {
    "grch38": {
        "hgvs_genomic_description": "NC_000023.11:g.630997del"
    }
}

Well "it's Logic Jim but not as we know it .... "

:joy:

Thanks for the info about chrY for SHOX. Have you got any examples so I can take a look and debug? Clearly, the mappings are there since genes2transcripts does indeed have them.

Sure! Fetching transcripts, getting locations on X and Y: https://www35.lamp.le.ac.uk/VariantValidator/tools/gene2transcripts/SHOX?content-type=application%2Fjson

LOVD deleting a base inside this region on X: https://www35.lamp.le.ac.uk/LOVD/lovd/hg19/NC_000023.10%3Ag.591732del/all/all/tx/True?content-type=application%2Fjson

hg19 only presents a mapping on X, hg38 on both X and Y. My guess is that perhaps the LOVD endpoint doesn't map on the build that I am already using.

Ugghhh... can we perhaps detect these and hold off the update, so we can then decide on whether to keep, replace, or perhaps manually curate in case it's not that many entries?

We really need to update this database like a couple of years ago.

Oh, I didn't mean to hold off the entire update! Like many, I'm desperately waiting for it :joy: I just meant that specific transcript.

(...) Ivo could point LOVD at the old version running on UTA and all other genes at the new version running VVTA.

That's not going to work as in part I won't even know the gene when I call the endpoint (using genomic variants). But don't worry, I don't even want to try this - I'd rather have an updated database with perhaps some strange things than no update.

Guys, we are starting to sound really geeky here!!!! Think we need to go to the pub and invent something!

... once they open! :joy:

In the case when we hold off we get a list of the skipped transcripts, which we could then use to manually check and re-load if we wish, so in principle yes.

Cool!

Either way though the change (or lack of change) is not easy to make visible to the users.

True, but I'm not sure if we need to. Showing differences in a certain way sounds like a project in itself.

We really need to add all the missing data i.e. new transcripts to VVTA yesterday. However, I'm happy enough to keep the current alignments for the transcripts that have been silently updated for now if this works for @ifokkema.

Well, it's not so much that I require the old alignments, I'm just curious what the updates are. Perhaps after checking a few manually, we'll see that it's all improvements, and we can just overwrite all differences. Perhaps the other way around. Not knowing where these updates come from just makes me prefer a conservative approach.

(...) run LOVD against this server to see how much data are affected.

Do you want to check all variants in LOVD for these genes? That would definitely be a separate project :joy: (well, OK, depending on how many genes pop up)

(...) so since Ivo will need to update everything anyway to accommodate the new alignments (...)

I actually haven't thought about this yet - it depends on the differences; I really need to find a way to "tag" which variant is the input variant - genomic or on the transcript level? Otherwise, I wouldn't know what to update.

I need to get some deep work done so I'm going to have to ignore this discussion for the rest of the day probably, sorry guys! :sweat_smile:

i3hsInnovation commented 3 years ago

There's an empty alt_genomic_loci and a populated primary_assembly_loci, so...? ๐Ÿคท I may be using the wrong terms here, but depending on the endpoint, I have to look for chrY mappings in either alt_genomic_loci or primary_assembly_loci, which is inconsistent. As long as I know, it doesn't really matter since I can adapt my VV object to parse both depending on the endpoint, and fetch what I need anyway, and new users might not use the LOVD endpoint so might not notice the inconsistency.

My bad. I forgot to turn on the lift over function!!! I need coffee!!

I see now the structure is slightly different indeed; is there a reason for this?

Yes, the logic used to build the VV data structure pre-dates our conversations with you and pre-dates us realising the PAR would be an issue. The structure VV uses means the interactive website does not break because it only ever expects 1 primary assembly mapping per genome build per transcript. I could look into making the API display as it does for the LOVD endpoint but also push Y to the ALT so we don't have to change the interactive interface. If you want this function @ifokkema please open a new issue ๐Ÿ˜„

OK, I have looked at the data structures for the VV Primary assembly output and the LOVD primary_assembly output. Trying to change VV now would cause us a lot of issues, however, for Ivo I suspect only minor inconvenience and frustration. I'm going to leave it for now because it would break our system and also impact on David and others because we would need to implement a brand new API version. One for the future when we get round to API versioning, but currently no big push to change this. Sadly its legacy thinking gone bad but a work-around has been identified

Thanks for the examples. Transferring to new issue

Do you want to check all variants in LOVD for these genes? That would definitely be a separate project ๐Ÿ˜‚ (well, OK, depending on how many genes pop up)

We need an MSc student!!!!

Have fun working. Think we have done this to death.

@John-F-Wagstaff. I think all we can do for now is build VVTA with the old alignments for affected transcripts for now. Pass @ifokkema the list of genes that will update and information on the new alignments, then on the next future release update the alignments as well. This way we give LOVD time to assess the upcoming carnage ๐Ÿ‘

John-F-Wagstaff commented 3 years ago

@i3hsInnovation OK I will stick to skip on exon difference, for the first official build, and get/keep the list of transcripts affected to attach here.

i3hsInnovation commented 3 years ago

Perfect. Thanks @John-F-Wagstaff

John-F-Wagstaff commented 3 years ago

So, I have now extracted the data on the transcripts that show exon count difference between main build sequences of GRCh37 and GRCh38. There are quite a few, numerically, though proportionally it is not so much, with 277 in the current dev build. These are mostly normal between build changes, but sometimes these are quite large, see below for NM_000442.5 as an example. In addition to these however there are also transcripts mapped to both X and Y with the same sort of change, which I presume are PAR transcripts, and a few where the chromosome to which the transcript has been mapped has changed, eg. NM_001079809.1 which was mapped to chr1 with 7 putative exons and is now mapped to chr3 with only 1. Since the uta/vvta both have no internal idea of genome build, only target sequence, we will have to do something manual to get rid of the older of the multi chromosome mappings, if we decide to.

Sample of the exon number difference data - transcript Target_1 Max_exon_ord target_2 Max_exon_ord_2 NM_000442.4 NC_000017.10 1 NC_000017.11 15 NM_000442.5 NC_000017.10 1 NC_000017.11 15 NM_000500.7 NC_000006.11 10 NC_000006.12 9 NM_001005513.1 NC_000011.10 0 NC_000011.9 1 NM_001008401.3 NC_000019.10 5 NC_000019.9 6 NM_001012288.2 NC_000023.10 4 NC_000023.11 6 NM_001012288.2 NC_000023.10 4 NC_000024.10 6 NM_001012288.2 NC_000023.11 6 NC_000024.9 4 NM_001012288.2 NC_000024.10 6 NC_000024.9 4 NM_001017915.3 NC_000002.11 25 NC_000002.12 26

Sample input alignment data derived from RefSeq's official alignment files. GRCh37 - refseq NM_000442.5 PECAM1 HGNC:8823 195,2412 0,17;17,4718 NC_000017.10 splign -1 62404838,62404856;62396774,62401205 9=1I8=;270D4431= refseq NM_000442.5 PECAM1 HGNC:8823 195,2412 0,259;259,286;286,580;580,886;886,1162;1162,1411;1411,1687;1687,1975;1975,2083;2083,2111;2111,2185;2185,2239;2239,2302;2302,2359;2359,2382;2382,6813 NW_003315947.1 splign -1 194448,194707;194335,194362;181670,181964;178897,179203;173596,173872;166995,167244;163986,164262;159957,160245;158779,158887;157337,157365;156236,156310;154226,154280;152106,152169;145480,145537;133546,133569;123261,127692 259=;27=;294=;306=;276=;249=;276=;288=;108=;28=;74=;54=;63=;57=;23=;4431= GRCh38 refseq NM_000442.5 PECAM1 HGNC:8823 195,2412 0,259;259,286;286,580;580,886;886,1162;1162,1411;1411,1687;1687,1975;1975,2083;2083,2111;2111,2185;2185,2239;2239,2302;2302,2359;2359,2382;2382,6813 NC_000017.11 splign -1 64390601,64390860;64390488,64390515;64377823,64378117;64375050,64375356;64369749,64370025;64363148,64363397;64360139,64360415;64356110,64356398;64354932,64355040;64353490,64353518;64352389,64352463;64350379,64350433;64348259,64348322;64341633,64341690;64329699,64329722;64319414,64323845 259=;27=;294=;306=;276=;249=;276=;288=;108=;28=;74=;54=;63=;57=;23=;4431=

Hope this helps

vidboda commented 3 years ago

wow guys impressive discussion.

May I add another enquiry? Would it be possible to add regions (exon/intron) boundaries within the VV object for each transcript (genomic coordinates)? I mean the boundaries of the region containing the variant. This may allow me to simplify the MobiDetails database and perhaps treat more than the 18,500 current genes available.

something like:

"primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        },
        "exonic_location": 
            {"start": "X/Y",
             "end": "X/Y",
             "genomic_start1": "XXXXXX",
             "genomic_end1": "YYYYYY",
             "genomic_start2": "ZZZZZZ",
             "genomic_end2": "AAAAAA",
        }, 

..... REST of JSON

requires many additional fields, but useful from my point of view.

Peter-J-Freeman commented 3 years ago

Hi @beboche. I think this might become a second iteration since the initial build is being created by a couple of very keen NHS scientist students. However, these guys always end up surprising us so it could end up being an iteration 2 as far as we are concerned, but we may well be able to finish it and deploy since they will be pulling in this data to create the initial json anyway.

There will be a slight amendment because you forgot transcript coordinates as well as genomic so I think this is more likely

"primary_assembly_loci": {
      "grch37": {
        "hgvs_genomic_description": "NC_000017.10:g.48275363C>A",
        "vcf": {
          "alt": "A",
          "chr": "17",
          "pos": "48275363",
          "ref": "C"
        },
        "genomic_span": {
          "end_position": xxxxxxxxxxxx,
          "start_position": yyyyyyyyyy
        },
        "exonic_structure": 
            [[tx_start, tx_end, geno_start, geno_end], [............]], 

Where tx_start will be the absolute position in the transcript and geno_start will be the matching genomic alignment. I think the list of lists will increase in gene order rather than genomic order so geno_end will for antisense genes be < geno_start. Pretty standard practice in alignment files.

This sound OK?

vidboda commented 3 years ago

This sounds ok, thanks @PeteCausey-Freeman

ifokkema commented 3 years ago

I mean the boundaries of the region containing the variant. This may allow me to simplify the MobiDetails database and perhaps treat more than the 18,500 current genes available.

Wouldn't the gene2transcripts endpoint be a better place for this? The region overlapping the variant can then easily be determined by comparing the variant's position with the exon list from gene2transcripts. Having this data would actually also allow us to calculate the exon/intron number ourselves, but oh well.