openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
69 stars 21 forks source link

Collecting recently failed variants as a list. please add #545

Open Peter-J-Freeman opened 1 year ago

Peter-J-Freeman commented 1 year ago

chr5:112839840_112839842delGGCinsTGA b38

Peter-J-Freeman commented 1 year ago

19-40397933-ATCT-A b38 11-118505219-TTC-T b38 '5-177248182-G-A b38 chr10:g.102360218C>G b38

All seem to be the same error

Traceback (most recent call last): File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/VariantValidator/modules/vvMixinCore.py", line 739, in validate toskip = self._get_transcript_info(my_variant) File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/VariantValidator/modules/vvMixinCore.py", line 1611, in _get_transcript_info variant.gene_symbol = entry['hgnc_symbol']

NOTE: These are now fixed

leicray commented 12 months ago

The attached text file contains a long list of variants that have triggered ERROR messages from the interactive validation tool since the start of September this year.

Some of these might now be handled correctly since the recent patches.

variants that trigger error messages.txt

GRCh37 variants fixed

leicray commented 12 months ago

It looks like a user is trying to validate NM_024496.4:c.369_374del which does validate correctly in the interactive tool.

However, the error message says:

Internal Server Error: /bed/

TypeError at /bed/ create_bed_file() missing 5 required positional arguments: 'chromosome', 'build', 'genomic', 'vcf', and 'version'

That looks like the vcf2hgvs tool is being used. However, that would require the user to place the variant in a text file and then upload that file to the vcf2hgvs tool. Possible, but unlikely.

Peter-J-Freeman commented 11 months ago

Variant: 1-156138613-C-T

Hello, I'm having a problem validating the synonymous variant in LMNA (ClinVar ID 14500) - NM_170707.4(LMNA):c.1824C>T p.(Gly608=). I tried different ways, including chr1(GRCh38):g.156138613C>T and 1-156138613-C-T. Message error: Unable to validate the submitted variant against the GRCh38 assembly Thank you in advance.

Peter-J-Freeman commented 11 months ago

It looks like a user is trying to validate NM_024496.4:c.369_374del which does validate correctly in the interactive tool.

However, the error message says:

Internal Server Error: /bed/

TypeError at /bed/ create_bed_file() missing 5 required positional arguments: 'chromosome', 'build', 'genomic', 'vcf', and 'version'

That looks like the vcf2hgvs tool is being used. However, that would require the user to place the variant in a text file and then upload that file to the vcf2hgvs tool. Possible, but unlikely.

This is the code trying to create a UCSC link I believe. Not VCF. Thanks for logging it

leicray commented 11 months ago

Here is another one that ought not to trip up the system: NM_000179.3:c.4083dup

It generates error messages from the interactive service and submission to the batch tools also fails. The reference sequence is the MANE Select transcript for the MSH6 gene.

The traceback message for failure to validate via the batch tool is:

Traceback (most recent call last): File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 752, in validate toskip = mappers.transcripts_to_gene(my_variant, self, select_transcripts_dict_plus_version) File "/local/py3Repos/variantValidator/VariantValidator/modules/mappers.py", line 643, in transcripts_to_gene protein_dict = validator.myc_to_p(hgvs_coding, variant.evm, re_to_p=False, hn=variant.hn) File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinInit.py", line 535, in myc_to_p start_aa = utils.one_to_three(aa_seq[0]) IndexError: string index out of range

In addition, this triggers a further exception:

Traceback (most recent call last): File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 412, in trace_task R = retval = fun(*args, *kwargs) File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 704, in __protected_call__ return self.run(args, **kwargs) File "/local/VVweb/web/tasks.py", line 60, in batch_validate output = validator.validate(variant, genome, transcripts) File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 1462, in validate raise fn.VariantValidatorError('Validation error') VariantValidator.modules.utils.VariantValidatorError: Validation error

Peter-J-Freeman commented 11 months ago

Thanks.

I think we have an issue open for debugging. Can you please add it. I want to do come debugging in a couple of weeks to release a new builod

From: leicray @.> Date: Tuesday, 31 October 2023 at 09:33 To: openvar/variantValidator @.> Cc: Peter Freeman @.>, Author @.> Subject: Re: [openvar/variantValidator] Collecting recently failed variants as a list. please add (Issue #545)

Here is another one that ought not to trip up the system: NM_000179.3:c.4083dup

It generates error messages from the interactive service and submission to the batch tools also fails. The reference sequence is the MANE Select transcript for the MSH6 gene.

The traceback message for failure to validate via the batch tool is:

Traceback (most recent call last): File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 752, in validate toskip = mappers.transcripts_to_gene(my_variant, self, select_transcripts_dict_plus_version) File "/local/py3Repos/variantValidator/VariantValidator/modules/mappers.py", line 643, in transcripts_to_gene protein_dict = validator.myc_to_p(hgvs_coding, variant.evm, re_to_p=False, hn=variant.hn) File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinInit.py", line 535, in myc_to_p start_aa = utils.one_to_three(aa_seq[0]) IndexError: string index out of range

In addition, this triggers a further exception:

Traceback (most recent call last): File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 412, in trace_task R = retval = fun(*args, *kwargs) File "/local/miniconda3/envs/vvweb_v2/lib/python3.10/site-packages/celery/app/trace.py", line 704, in protected_call return self.run(args, **kwargs) File "/local/VVweb/web/tasks.py", line 60, in batch_validate output = validator.validate(variant, genome, transcripts) File "/local/py3Repos/variantValidator/VariantValidator/modules/vvMixinCore.py", line 1462, in validate raise fn.VariantValidatorError('Validation error') VariantValidator.modules.utils.VariantValidatorError: Validation error

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.com/v3/__https:/github.com/openvar/variantValidator/issues/545*issuecomment-1786840459__;Iw!!PDiH4ENfjr2_Jw!FHx9A_rx_a9tND79UlqIDMpebg4S8W7HJ37ylSaiTJM8UjpmuSOiCtgKa7BsESnfYX5GJ9HO5QF136PHQjSHPJrYr1r32yS14jjSzDz7$, or unsubscribe [github.com]https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AGKWROOALHIL3I72Y4AVW7LYCDAWDAVCNFSM6AAAAAA4V7I47CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBWHA2DANBVHE__;!!PDiH4ENfjr2_Jw!FHx9A_rx_a9tND79UlqIDMpebg4S8W7HJ37ylSaiTJM8UjpmuSOiCtgKa7BsESnfYX5GJ9HO5QF136PHQjSHPJrYr1r32yS14iAYSCqb$. You are receiving this because you authored the thread.Message ID: @.***>

leicray commented 11 months ago

What do you mean by "add it"? This is the report.

Peter-J-Freeman commented 11 months ago

Sorry, I meant to the open git issue. You already collated a few variants that fail processing I believe??

Grant coming well. Should be able on time.

Dr Peter Freeman Lecturer in Healthcare sciences (Clinical bioinformatics, genomics) Division of Informatics, Imaging & Data Science Faculty of Biology, Medicine and Health | The University of Manchester G.725 | Stopford Building | Oxford Road | Manchester | M13 9PT Tel: +44(0) 161 275 5731 email: @.**@.> web: Peter Freemanhttps://www.research.manchester.ac.uk/portal/peter.j.freeman.html [A close-up of a logo Description automatically generated] website: www.manchester.ac.ukhttp://www.manchester.ac.uk/ Social media: Facebookhttps://www.facebook.com/TheUniversityOfManchester Twitterhttps://twitter.com/OfficialUoM LinkedInhttps://www.linkedin.com/school/university-of-manchester/ Instagramhttps://www.instagram.com/officialuom/ YouTubehttp://www.youtube.com/user/universitymanchester

[VariantValidator Logo] web: www.variantvalidator.orghttp://www.variantvalidator.org/ Social media: Twitterhttps://twitter.com/intent/follow?ref_src=twsrc%5Etfw%7Ctwcamp%5Ebuttonembed%7Ctwterm%5Efollow%7Ctwgr%5EVariantValidatr&screen_name=VariantValidatr Facebookhttps://www.facebook.com/VariantValidator Buy-us-a-coffee, supporting SWAN UKhttps://www.buymeacoffee.com/VariantValidatr

From: leicray @.> Date: Tuesday, 31 October 2023 at 09:44 To: openvar/variantValidator @.> Cc: Peter Freeman @.>, Author @.> Subject: Re: [openvar/variantValidator] Collecting recently failed variants as a list. please add (Issue #545)

What do you mean by "add it"? This is the report.

— Reply to this email directly, view it on GitHub [github.com]https://urldefense.com/v3/__https:/github.com/openvar/variantValidator/issues/545*issuecomment-1786860236__;Iw!!PDiH4ENfjr2_Jw!A2DE_rJKOiwaoSi0oA5VBfh8Q8L0zmh10q13s0bUmWxk8Rz9uNUg2TU141M9V4B7xAV1GJ2mBz88dn7oWA8VB7KtHbqrwLi-uIZ3j73U$, or unsubscribe [github.com]https://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AGKWROI7MOHUS252ZZQATUTYCDB7JAVCNFSM6AAAAAA4V7I47CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBWHA3DAMRTGY__;!!PDiH4ENfjr2_Jw!A2DE_rJKOiwaoSi0oA5VBfh8Q8L0zmh10q13s0bUmWxk8Rz9uNUg2TU141M9V4B7xAV1GJ2mBz88dn7oWA8VB7KtHbqrwLi-uI96yXnE$. You are receiving this because you authored the thread.Message ID: @.***>

leicray commented 11 months ago

Here is another one that trips up the interactive and batch validators:

11:2587692del (GRCh38)

Peter-J-Freeman commented 11 months ago

Thanks @leicray . Realised its a git email this time. I'm gonna do a little debugging now. Need time away from grant writing

leicray commented 11 months ago

And another one:

chr5:g.125887814C>T (GRCh37)

Peter-J-Freeman commented 11 months ago

Will come back to this one NG_059281.1:g.4962G>C (GRCh38). It's a database issue. Missing records

Peter-J-Freeman commented 11 months ago

This one too NG_061374.1:g.11229T>C (b38)

Peter-J-Freeman commented 11 months ago

So, the issue was that RefSeq are not maintaining RefSeqGene lookup tables. I added code to get the data from the API on fails. These variants are not fixed, but will not be fixed live until I do a new database build

Peter-J-Freeman commented 11 months ago

or at least do a interim update on the live servers which may be quicker for now.

Peter-J-Freeman commented 11 months ago

Here is another one that trips up the interactive and batch validators:

11:2587692del (GRCh38)

I don't know if I have the words.

leicray commented 11 months ago

I did wonder about that one. However, there is a genome build provided, a chromosome, a nucleotide number, and the nature of the change to that nucleotide. In a sense, it's little different from chr17:50198002C>A. What am I missing?

Peter-J-Freeman commented 11 months ago

It's not that sample sadly. I will need to figure out where to pus a Regex to catch it. I'm sure it'll fit. Hopefully with the code that allows chr17:50198002C>A. The difference is that chr17:50198002C>A is derived as art of pseudo VCF re-formatting. The description 11:2587692del is a bit different because 50198002C>A comes from 50198002:C:A. 11:2587692del should be derived from somethign like 50198002:CC:C not "del". Hopefully its a quick tweak though. Fun times! At least you came up with a reasonable explanation as to where the description came from

Peter-J-Freeman commented 9 months ago

NC_000023.11:r.650_831del

leicray commented 8 months ago

chr11:g,108121787G>A GRCh37

The anonymous submitter also tried GRCh38 and that failed too, of course.

This should be easy to trap and correct as the comma just needs to be replaced by a full stop.

Peter-J-Freeman commented 8 months ago

Will get this one done asap. Easy one hopefully

leicray commented 8 months ago

An anonymous user has tried to validate LRG_199p1:p.? and it has failed, generating an error message.

If I rewrite the variant description as LRG_199p1:p.Met1Ala I receive the expected warnings:

- LRG_199p1:p.Met1Ala automapped to equivalent RefSeq record NP_003997.1:p.Met1Ala

- Protein level variant descriptions are not fully supported due to redundancy in the genetic code

- NP_003997.1:p.Met1Ala is HGVS compliant and contains a valid reference amino acid description

Ought to be easy to trap.

ifokkema commented 8 months ago

If I rewrite the variant description as LRG_199p1:p.Met1Ala I receive the expected warnings:

I might be wrong, but are you suggesting that is valid syntax? Because a change to the first codon leads to an unpredictable result. The docs say:

Do not use descriptions like "p.Met1Thr", this is for sure not the consequence of the effect on protein translation.

(source)

leicray commented 8 months ago

You are quite correct. I simply wanted generate a variant description that would not cause the validator to fall over. I have no idea what comes next after Met1 in the DMD protein sequence, so pushed on with that.

Of course, there ought to be an additional warning that p.Met1Ala is not valid and ought to be written as p.(Met1?). Even that might be wrong.

Peter-J-Freeman commented 8 months ago

This should be triggering the warning and I wonder if it is trying to and failing. Will look into it

ifokkema commented 8 months ago

You are quite correct. I simply wanted generate a variant description that would not cause the validator to fall over. I have no idea what comes next after Met1 in the DMD protein sequence, so pushed on with that.

Ah, OK, you were just testing the reference sequence :sweat_smile: Never mind me!

Peter-J-Freeman commented 8 months ago

I'm still worried that the Met1 warning wasn't generated. So 2 fixes here. A chance to increase code coverage :P

Peter-J-Freeman commented 8 months ago

@leicray @ifokkema. Ok, here I put a spanner in the works. p.Met1Ala could actually be correct wheres p.(Met1Ala) would be p.(Met1?)

ifokkema commented 8 months ago

Hmm... I don't think that has ever been observed in humans... ClinVar reports this variant, but ClinVar always lies when it comes to protein descriptions :roll_eyes: Are you thinking of ever providing full protein description validation? If not, I would personally ignore the near-zero chance of any substitution in the Met1 codon. While translation has been proven to sometimes start at non-CTG start codons, we're actually talking about the situation where a canonical transcript by default started with ATG but now also tolerates a non-ATG start induced by a variant. CTG being the most common non-ATG start codon, in theory, a p.Met1Leu could occur. Googling around allowed me to find one paper mentioning this, but at the same time, the variant also lowered translation considerably, so even then, p.Met1Leu wouldn't actually be the correct description.

leicray commented 8 months ago

An anonymous user has been trying to validate NP_000059.3:p.(=) and NP_000059.3:p.= resulting in the on-screen error messages Unable to validate the submitted variant NP_000059.3:p.(=) against the GRCh38 assembly. and Unable to validate the submitted variant NP_000059.3:p.= against the GRCh38 assembly. In addition, they also result in ERROR messages to the admins.

Minimally, clearer on-screen error messages are needed.

Peter-J-Freeman commented 8 months ago

Hmm... I don't think that has ever been observed in humans... ClinVar reports this variant, but ClinVar always lies when it comes to protein descriptions 🙄 Are you thinking of ever providing full protein description validation? If not, I would personally ignore the near-zero chance of any substitution in the Met1 codon. While translation has been proven to sometimes start at non-CTG start codons, we're actually talking about the situation where a canonical transcript by default started with ATG but now also tolerates a non-ATG start induced by a variant. CTG being the most common non-ATG start codon, in theory, a p.Met1Leu could occur. Googling around allowed me to find one paper mentioning this, but at the same time, the variant also lowered translation considerably, so even then, p.Met1Leu wouldn't actually be the correct description.

Not disagreeing, just want to hit a consensus. Met is def not the only Human init amino acid. I agree with what you are saying, I just wanted to be semmantic over the fact that no parentheses would usually be treated as an observation. It would be far easier to always assume Met1?

If there are no arguments against, then I will do this

leicray commented 8 months ago

I genuinely need some education on this matter. Is it the case that some proteins initiate with an amino acid other than Met? Alternatively, is an initiating Met sometimes specified by a codon other than ATG?

Peter-J-Freeman commented 8 months ago

Yes, there are quite a few human genes that initiate at an amino acid other than Met / ATG and a whole array of different initiation codons / amino acids.

Many but not all came from genes that transferred from Mitochondria into the Human Genome. These are the codons that are used that I am aware of

'ATT', 'ATC', 'ATA', 'ATG', 'GTG', 'ACG'

https://github.com/openvar/variantValidator/blob/master/VariantValidator/modules/utils.py#L469

ifokkema commented 8 months ago

Not disagreeing, just want to hit a consensus. Met is def not the only Human init amino acid. I agree with what you are saying, I just wanted to be semmantic over the fact that no parentheses would usually be treated as an observation. It would be far easier to always assume Met1?

This is my thought pattern;

I genuinely need some education on this matter. Is it the case that some proteins initiate with an amino acid other than Met?

Yes, the most common one is CTG (Leu). But although experiments have shown lots of transcripts have translation initiation at codons different from ATG, it's not always clear if these initiations actually lead to a functional product and whether, indeed, the product's protein sequence fully matches the RNA sequence.

Alternatively, is an initiating Met sometimes specified by a codon other than ATG?

Not to my knowledge, no.

Peter-J-Freeman commented 8 months ago

Yes, the most common one is CTG (Leu). But although experiments have shown lots of transcripts have translation initiation at codons different from ATG, it's not always clear if these initiations actually lead to a functional product and whether, indeed, the product's protein sequence fully matches the RNA sequence.

However, the protein reference sequences for these genes do not begin with Met

ifokkema commented 8 months ago

However, the protein reference sequences for these genes do not begin with Met

Actually, my point is that many do. I'm talking about non-annotated translation initiation sites (TIS) as alternative to the canonical ATG TIS present in the same gene. Ribosome profiling shows the translation initiation and we have analyzed the sequence at which these events occur and on what transcripts. Most non-canonical TIS were CTG and had an annotated ATG TIS in the same gene.

Peter-J-Freeman commented 8 months ago

Either, we pretend to be smarter than the user, and we make a "whitelist" of amino acids that Met1 can change into. We could allow p.Met1Leu or force it to p.[Met1Leu,0]. Other amino acids will be changed to p.(Met1?). We could choose to allow p.(Met1...) assumptions more broadly. I wouldn't personally force p.Met1Leu into p.(Met1?), since p.Met1Leu has been observed (albeit with a lower translation as well).

If we wanted to do this then the amino acids I know of would be

'ATT', 'ATC', 'ATA', 'ATG', 'GTG', 'ACG' 'Ile', 'Ile', 'Ile'. 'Met' 'Val', 'Thr'

I do not have CTG in the list which is derived from RefSeq plus ACG which I found due to a processing error!

For clarity, these are the initiation codons I am aware of that are annotated as the initiation codon in the c. reference and the amino acid at position 1 of the coresponding p. reference sequence

Peter-J-Freeman commented 8 months ago

Actually, my point is that many do. I'm talking about non-annotated translation initiation sites (TIS) as alternative to the canonical ATG TIS present in the same gene. Ribosome profiling shows the translation initiation and we have analyzed the sequence at which these events occur and on what transcripts. Most non-canonical TIS were CTG and had an annotated ATG TIS in the same gene.

Ah, I see. Gonna pretend I don't know this ;P I think we stick to what't in the reference or we open a whole big can of worms!

Peter-J-Freeman commented 8 months ago

OK @leicray and @ifokkema . For approval

Variant = LRG_199p1:p.(Met1Ala) Variant NP_003997.1:p.(Met1Ala) affects the initiation amino acid so is better described as NP_003997.1:p.(Met1?) And we update the descriptions accordingly

Variant = LRG_199p1:p.Met1Ala Variant NP_003997.1:p.Met1Ala affects the initiation amino acid so is better described as NP_003997.1:p.(Met1?) And we update the descriptions accordingly and update to the () syntax.

leicray commented 8 months ago

This is entirely reasonable for all proteins that initiate translation with Met. We know that LRG_199p1 begins with Met.

However, there are some human proteins that do not initiate with Met, but with other amino acids. Sequence variants in the start codons of such proteins could result in the initiating amino acid being changed to another amino acid that is also capable of initiating translation, unless it is clearly known that that is not possible in general, or for particular proteins.

For example, c.3T>G would change the start codon from Ile to Met for a protein that normally initiates with Ile. If the change from Ile to Met still supports initiation of translation, the protein-level description would have to be p.(Ile1Met), rather than p.(Ile1?).

I presume that non-Met initiation codons must also lie within a Kozak consensus sequence. However, I do not actually know if that is actually the case.

Peter-J-Freeman commented 8 months ago

Yes,

I have made the code spot 1 not just Met1 both during detection and conversion. So, interestingly, there is an additional test. In theory a variant described as p.(Met1Ala) for a reference that does not start Met will still be updated to the Correct amino Acid.

leicray commented 7 months ago

A user has twice submitted the invalid variant description DPYD:C.1905+1G>A. The corrected description NM_000110.4:c.1905+1G>A validates correctly.

The invalid description generates the on-screen error message Unable to validate the submitted variant DPYD:C.1905+1G>A against the GRCh38 assembly. but also causes an ERROR message to be sent to sysadmins.

Peter-J-Freeman commented 7 months ago

Thanks.

This would usually be an error that we catch, but the combination of Gene Symbol and Upper case C. is going to be why its being missed. Will try resolve today.

ifokkema commented 7 months ago

In case it helps, in our logic, we split first on colon, if present. If the first part doesn't look like a reference sequence, we generate a separate error for that. Then, we process everything after the colon, and, in this case, we would generate the warning that the c. should be in lowercase. The suggestion of that correction is only shown when the first error is fixed, though, as suggestions are hidden if they themselves are not correct (and DPYD:c.1905+1G>A is not valid).

Screenshot_2024-02-19_11-35-41

Submitted without the reference sequence:

Screenshot_2024-02-19_11-37-51

leicray commented 7 months ago

This sort of relates to the previous "failed variant".

A user submitted the variant description NP_00455.4:p.P217_Q220delinsG for validation. The error displayed error message for this variant is Unable to validate the submitted variant NP_00455.4:p.P217_Q220delinsG against the GRCh38 assembly. In addition, an ERROR message is sent to the sysadmins.

The sequence identifier is clearly invalid because the minimum length of the numeric part sequence identifier is 6 digits prior to the version. That ought to be trapped during initial parsing.

Peter-J-Freeman commented 7 months ago

The sequence identifier is clearly invalid because the minimum length of the numeric part sequence identifier is 6 digits prior to the version. That ought to be trapped during initial parsing.

No, this is not a simple catch as it may sound. We have never looked at the length of the identifier, and they are not a standard length. I'll trap this one too.

In case it helps, in our logic, we split first on colon, if present. If the first part doesn't look like a reference sequence, we generate a separate error for that. Then, we process everything after the colon, and, in this case, we would generate the warning that the c. should be in lowercase.

Similar here, its just the downstream logic in this partiuular event may be out of order so will need a tweak. Will get these sorted ASAP as I want to release the next major tag

leicray commented 7 months ago

Identifiers might not have a defined length but, in practice, there is a minimum length.

Peter-J-Freeman commented 7 months ago

True, but ultimately, the message is "ID is incorrect / absent from our database" We need to keep it simple

Peter-J-Freeman commented 7 months ago

DPYD:C.1905+1G>A now will return

        "validation_warnings": [
            "Reference type incorrectly stated in the variant description DPYD:C.1905+1G>A Valid types are g,c,n,r, or p",
            "HGVS variant nomenclature does not allow the use of a gene symbol (DPYD) in place of a valid reference sequence: Re-submit DPYD:c.1905+1G>A and specify transcripts from the following: select_transcripts=NM_000110.3|NM_000110.4|NM_001160301.1"