openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Add error codes to all warnings #637

Open Peter-J-Freeman opened 1 month ago

Peter-J-Freeman commented 1 month ago

Is your feature request related to a problem? Please describe. We need consistent error codes for all errors and warnings so other tools can searh for them without error messages breaking flow.

Loop in @ifokkema

Peter-J-Freeman commented 1 month ago

We also need to document the codes and come up with a release strategy

Peter-J-Freeman commented 1 month ago

This arose from a recent push that broke LOVD. Sorry @ifokkema.

https://github.com/openvar/variantValidator/pull/635#issuecomment-2236172039

So I have updated the warning to include the error code "TranscriptVersionWarning: A more recent version of the selected reference sequence ENST00000382483.3 is available for genome " "build GRCh38 (ENST00000382483.4)"

However, we have lots of error messages still lacking codes. This will be a big effort that I think is vital though @John-F-Wagstaff

John-F-Wagstaff commented 1 month ago

Hello Pete This looks like a good idea in the long term, but also like a lot of work. Do you want me to add this to my current priorities and join in on this? or keep track of it for now and help, at request, for specific regions?

Code wise we probably want to create a organised pattern for our warning/error/info codes and probably want to work out and document the codes before we start to work on it. How are you thinking of structuring this?

The current suggestion seems to follow the pattern of camel case with an order of region-problem-severity: description. This would then make patterns like -

for example - hgvsDataAvailabilityError: This sequence NM_XXXXX.X could not be found in our sequence store but other people do things like catagory-severity:number description, so for example hgvsError:1 This sequence NM_XXXXX.X could not be found in our sequence store

Either way we need a constant pattern, since a lot of the time other consumers of these codes will only care about some specific part of the code, e.g. is this an error a significant warning or cautionary information, or is this a transcript error. If it is not consistent internally then our other tools, and the other users that consume these errors won't be able to actually make consistent use of them.

Any thoughts on the pattern?

Peter-J-Freeman commented 1 month ago

I like the camel case codes. We have a few in the test set alrady. Look in the test_warnings. Lets drag these into this issue and agree a pattern. We can work together on the coding. Yes, I think this woul be best to prevent us from going crazy.

What we can do is find the errors, and come up with the error code for each error together, then apply throughout in turn. I think the low hanging fruit is our own errors. Then we get more complex trying to resolve how to handle the HGVS.py errors

Peter-J-Freeman commented 1 month ago

The core search term to spot our warnings in the VV Repo will be .warnings.append( I would think

ifokkema commented 1 month ago

Thanks for this, guys! It will make our code a lot simpler and reduce the chances of breaking stuff in the LOVD/VV communication.

Code wise we probably want to create a organised pattern for our warning/error/info codes and probably want to work out and document the codes before we start to work on it.

Agreed; this was a lot of effort for me (I use codes rather than description parsing for internal communication between functions), but it really helped get everything consistent. I personally started with my test sets, assuming that most possible errors would be in there. I resorted my tests based on the type of error and then started adding the codes.

Either way we need a constant pattern, since a lot of the time other consumers of these codes will only care about some specific part of the code, e.g. is this an error a significant warning or cautionary information, or is this a transcript error.

Exactly; currently, I have a list of regexes that parse the descriptions from VV and categorize them so that I know what is what. I also rewrite some warnings and errors but still rely on consistent descriptions to recognize what we're receiving. Being able to check for some codes that I want to rewrite and knowing what the rest is based on the structure of the code will reduce the need for me to make updates whenever VV is updated. I may, in the future, even incorporate the VV error codes in the LOVD3 source code. Currently, I'm assigning my own codes based on the VV descriptions.

John-F-Wagstaff commented 1 month ago

We are adding code to better handle allele variant descriptions. This is done by separating the variant out into position edits, validating the parts and finally re-composing the variant(s) from the component validated position edits. This generates the possibility of new and "exiting" errors/warnings when data fails to match up, which will also need their own prefix.

So in the case of a Genomic/Alt/LRG/RSG mapping fails/does not exist, for some but not all of the position edits, or a, Genomic/LRG/RSG mapping where the target of the mapping is different for some of the position edits (this seems most likely as a non-error case for LRG/RSG mappings for genomic alleles), would there be a useful difference between an error/warning suffix like AlleleReasembalyDataMissmatchWarning and AlleleReasembalyDataMissingWarning, for end user purposes, or would just the more generic AlleleReasembalyWarning be sufficient on its own for both types of problem.

If you think that AlleleReasembalyLRGWarning/AlleleReasembalyGenomicWarning etc would be more useful then this is also a possibility.

ifokkema commented 3 weeks ago

Hi John, was this question (also) for me or for Pete? In case my input is requested, I'm not sure I fully follow the question. Do you have (theoretic if needed) examples?

John-F-Wagstaff commented 3 weeks ago

Hello Ivo

Yes it was mostly a question for you, Pete thinks the more generic warning for any issue putting an allele back together probably should be OK, but I was concerned that they might be too simple to actually be usefully informative that way, so I just wanted to check before putting them in. We don't want to have to change error prefixes after the fact. Both of these errors/warnings would be triggered by hgvs alleles (see Here for examples if you are unfamiliar) which need splitting up and analysing as parts, when the parts are re-composed into a singe allele again. So the question was, could you as an end user see some utility in having the separate error prefixes for the different types of problem our would just one (provisionally AlleleReasembalyWarning) be better?

If we did split this up I would currently split the warnings into 2 categories:

The first type of warning (provisionally AlleleReasembalyDataMissmatchWarning unless you or Pete had a better name) would be triggered if we had an allele description that was genomic type (but which could be main Genomic, Alt, LRG, or RSG) but for which there was no single transcript mapping that contained the locations of all of the variants described. An example of this would be an allele description that included changes to 2 neighbouring genes, all of which validate but not on the same transcript (which could then have differing LRG/RSG mappings too). There could be no single set of transcript mapping data for this allele, and any between genome build mappings (GRCh37<->GRCh38) will be using data from multiple transcript alignments. (I can also see some arguments for returning this when the input positions are partly intergenic but sometimes map to one transcript.)

The second type of warning (which might be better as an error, provisionally AlleleReasembalyDataMissingWarning) would be triggered if we had an allele describing a set of positions, but only some of them could be mapped to the current reference. I can see three types of data that might trigger this, without actual errors

  1. If we, as above, had a single input g type allele description, that affected two genes, then we could get two sets of transcript mappings, each only representing parts of the input allele description. These HGVS transcript results would need a warning to note that while they where valid internally they did not include all of the input's described positions.
  2. Alternatively allele descriptions contained within one of the longer splice variants of a gene would, produce mappings for other transcripts, but some might turn up without the full set of input position edits.
  3. Finally this might be triggered by data returned for alleles that partly overlap a gene in/with a RSG/LRG/Alt sequence. If there was a transcript mapping to both the main genomic reference and one of these we could get partial g type mappings back. For now if the input was a transcript, or a main chromosomal position, I am currently thinking that I would just remove the partial alt mapping or RSG/LRG mapping annotations from the returned data, treating it as a non error, since such returned data would be centred around the original genomic/transcript input. This might change in the future (and if so might need different error prefixes so you can tell which mappings are affected hence the suggestion that we might want AlleleReasembalyLRGWarning / AlleleReasembalyGenomicAltWarning etc if we did this to define the error out, but this would probably best be left to a later question if needed).

As an example of how you might use this, you might allow data to be used if it only has a AlleleReasembalyDataMissmatchWarning attached, since it is complete if not ideal in it's consistency, but exclude data with AlleleReasembalyDataMissingWarning as incomplete. Alternatively for users wanting data specific to individual genes data with AlleleReasembalyDataMissingWarning: attached might be fit for their purposes, if incomplete, but AlleleReasembalyDataMissmatchWarning would indicate data that might be affecting multiple genes and could need to be excluded. The question was whether you thought this would be useful in practice, or whether the detail was overkill? Sometimes it is hard to get perspective on this kind of thing when you are in the middle of writhing the code.

With thanks