openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
69 stars 21 forks source link

variantValidator is not returning a correct gNomen #627

Closed RanWei1987 closed 2 months ago

RanWei1987 commented 3 months ago

On variant validator website (https://variantvalidator.org/service/validate/), when typing in the variant in vcf form:

chr1:100340225:G:GTCTTTTCTTTCTTTTAGAAAATAGTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCA

and select "GRCh37 (hg19)", the website returns correct cNomen:

NM_000642.3:c.965_966insTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCATCTTTTCTTTCTTTTAGAAAATAG

but incorrect gNomen:

NC_000001.10:g.100336426_100340225delinsAAAATAGTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCA

In addition, the VCF descrition returned is changed to an extremely long and definitely wrong one. I am not listing it here.

Could anyone look into this? The correct gNomen should be

Chr1(GRCh37):g.100340249_100340250ins68

leicray commented 3 months ago

I presume that you are a user of Alamut Visual Plus as the terms "cNomen" and "gNomen" are, as far as I know, peculiar to that sequence analysis package. They are not approved HGVS terms. The genomic description Chr1(GRCh37):g.100340249_100340250ins68 is not HGVS-compliant and looks like it was generated in part by Alamut Visual Plus.

We will investigate this issue, but it may take some time as our lead developer is currently away.

RanWei1987 commented 3 months ago

I presume that you are a user of Alamut Visual Plus as the terms "cNomen" and "gNomen" are, as far as I know, peculiar to that sequence analysis package. They are not approved HGVS terms. The genomic description Chr1(GRCh37):g.100340249_100340250ins68 is not HGVS-compliant and looks like it was generated in part by Alamut Visual Plus.

We will investigate this issue, but it may take some time as our lead developer is currently away.

Yes, I use Alamut and variant validator to cross-validate complex variants. Sometimes Alamut makes mistake, but this time variant validator has an issue. As you pointed out, the genomic description given by Alamut is not HGVS-compliant but VV gives a rediculous desription - it should not be a ~4000bp delins.

John-F-Wagstaff commented 3 months ago

Thank you for reporting this issue.

This looks like it may be a problem with a variant that maps exactly at the intron/exon boundaries. Specifically it looks like your variant maps exactly at the boundary of the 7th intron of NM_000642.3. As you can imagine these locations make mapping to and from transcripts much more complex, and getting this right is an ongoing source of problems.

I will dig more into how exactly this is going wrong, and see about fixing it if I can, but pushing the fix to the website may take longer than normal due to the issues @leicray mentioned above. if you do not object then we will also add your problematic variant to the list of test variants, in order to prevent this problem from re-occurring.

I will update you with more information when I have tracked the exact issue down and, if possible, built a preliminary fix.

leicray commented 3 months ago

I have had a look at this in Alamut Visual Plus and, AFAICS, the variant does not map precisely to an intron exon boundary, but just into the start of exon 8. This might be an issue with regard to where the actual boundary does map: see attached screenshot from Alamut Visual Plus.

AGL GRCh37 exon 8 start

It looks like the 3' boundary of the upstream intron could potentially be mapped to the highlighted G at g.100340249 rather than at the displayed position seven nucleotides upstream. Either location might work as a splice acceptor site, but the real one would need to be determined empirically.

John-F-Wagstaff commented 3 months ago

Ah sorry I was simplifying perhaps a bit too much there.

The hgvs standard requires us to normalise all indels in a 5' direction if this is possible, in order to stop the same variant being reported as if it was different, and we have code to do this. I was actually suspecting the confluence of our code to push in/dels in a 5' direction and the exon edge mapping issues, since the output goes from being a 63bp insertion 17 bp past the end of the 7th intron to spanning the entirety of the 7th intron.

This is not a part of the code that I have done much work on, so I am just speculating from the previous problems I have seen in the area, hence the "may be", not "is". I have put a few tracers into my local copy to try and find the issue, but I need to dedicate a fair bit of time to track this kind of complex bug down, depending on where the problem is "hiding".

RanWei1987 commented 3 months ago

Thank you for reporting this issue.

This looks like it may be a problem with a variant that maps exactly at the intron/exon boundaries. Specifically it looks like your variant maps exactly at the boundary of the 7th intron of NM_000642.3. As you can imagine these locations make mapping to and from transcripts much more complex, and getting this right is an ongoing source of problems.

I will dig more into how exactly this is going wrong, and see about fixing it if I can, but pushing the fix to the website may take longer than normal due to the issues @leicray mentioned above. if you do not object then we will also add your problematic variant to the list of test variants, in order to prevent this problem from re-occurring.

I will update you with more information when I have tracked the exact issue down and, if possible, built a preliminary fix.

Hi John, feel free to add this variant to the test list. I am glad this would help improve VV performance.

John-F-Wagstaff commented 3 months ago

Hello RanWei1987 thank you for your permission to use this variant in tests.

The current (bad) state

I have finished going through the code, with your variant, and we seem to have the following process going on internally - Firstly chr1:100340225:G:GTCTTTTCTTTCTTTTAGAAAATAGTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCA in GRCh37 gets converted to NC_000001.10:g.100340249_100340250insTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCATCTTTTCTTTCTTTTAGAAAATAG which is basically what you want and expect.

This is then mapped to transcripts, NM_000642.3 in your case, as NM_000642.3:c.965_966insTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCATCTTTTCTTTCTTTTAGAAAATAG which we then verify by mapping back to the genome as NC_000001.10:g.100340249_100340250insTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCATCTTTTCTTTCTTTTAGAAAAT

However this is where 5' shifting comes into play. Since we always want the same real variant to be reported in the same way the hgvs standard specifies shifting any in/del type variant as 5' as possible within transcripts, which makes NM_000642.3:c.958_959insAAAATAGTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCATCTTTTCTTTCTTTTAG the official form of the variant within the transcript, and this is where we run into problems.

Most variant definitions for hgvs are based on whole base count coordinates, but this does not work for insertions and so the ins is defined as between two bases - as NM_000642.3:c.958_959ins, unfortunatly 958 and 959 are the 2 bases either side of the intron and so we have problems deciding what to do. At the moment this then causes the genomic mapping of NC_000001.10:g.100336426_100340225delinsAAAATAGTGACAGTTTTAATCTCTTTGTAGATATTTGCATTTAAGGTATCA but, because of the problems re-mapping this to the transcript the reported transcript variant gets reverted to the original. Unfortunately the genomic mapping does not get properly reverted too, which then leads to the output you received.

Solutions and limitations

I could put in code to revert the genomic mapping too at this point, and might do this, for now, instead of a better fix. Although, without my colleague currently available to review my code before submission, even this would take a while to get onto the website. This however would leave us unable to round trip this type of variants from the genome to the transcript or in order to verify their correctness.

Although we can work out from the input, in your case, that the insertion happens at the end of the intron, the genomic mapping code itself only has access to the specific variant being mapped at this point, and so can not tell whether we need - exon-intron-ins-exon exon-ins-intron-exon exon-ins-exon (i.e. the intron got deleted and replaced with a small ins at the break-point, which is what the current apparent output looks like) or even a different variant in the intron that moves the splice site.

As such any fix will only work for originally genomic variants descriptions. Even so I have a kludge added to my local version, that may be refineable into proper error handling code, that can work out that the original genomic mapping points at one of the ends (100340225 in your case, which would then point back at NC_000001.10:g.100340249_100340250ins). This would fix the problem of getting exon-ins-exon back for either exon-intron-ins-exon or exon-ins-intron-exon variants, when we start with a pre existing genomic variant call. This is not perfect but would be a better way of handling issues like yours. Unfortunately, besides tidying this code we will need to put in a fair bit more work to

  1. get this fixed for any similar genomic inputs that make sense, not just the current test input
  2. make sure that this does not produce inappropriate results in other circumstances,
  3. give appropriate warnings when we can't catch issues, and
  4. add sufficient tests to prevent this functionality from degrading in the future.

I will also need to consult on some of the details too, so it might be a while before a full fix is pushed to the server.

I will update you further on my progress when my colleague who is more in tune with this area of the code is available to go through this with me and verify my code structure and intended output.

Peter-J-Freeman commented 2 months ago

Hi all

Fixed this. Turned out to be a 1-liner :)

Will be rolled out soon. Want to upodate the servers this week

John-F-Wagstaff commented 2 months ago

ugh, I was trying to do a full fix in myvm_t_to_g, which ended up being a lot more complex, and might have required other changes elsewhere too. That is a lot simpler. Thank you.

Peter-J-Freeman commented 2 months ago

Haha. @John-F-Wagstaff , I need to teach you my bug finding technique. It's slow, but often leads to a simple solution :)