openvar / variantValidator

Public repository for VariantValidator project
GNU Affero General Public License v3.0
67 stars 21 forks source link

Automatically predict timeouts and prevent them. #471

Open ifokkema opened 1 year ago

ifokkema commented 1 year ago

Is your feature request related to a problem? Please describe. VariantValidator, when having received too much work for one request, times out. This happens when, e.g.,

  1. A large variant is submitted (see #151);
  2. When a variant overlapping many transcripts is submitted, and the user doesn't restrict the output to specific transcripts;
  3. When too many options are enabled (e.g., liftover or protein change prediction).

The first point may not be recoverable, but options 2 and 3 are. While in option 3, additional options may simply be turned off, in the case of option 2, the user may not be aware of what transcripts overlap the given variant. The user may then not be able to simplify the request without some guidance.

Describe the solution you'd like It would be great if VV could predict that the given task will take too much time. If so, instead of timing out, VV could either:

Describe alternatives you've considered Internally, LOVD doesn't send variants that are too large to VV, because it expects a timeout. However, we can't always predict this. For instance, LOVD never knows how many transcripts VV knows that overlap a certain position.

Additional context Example: https://rest.variantvalidator.org/LOVD/lovd/hg19/NC_000010.10%3Ag.11330462A%3EG/refseq/all/tx/primary?content-type=application%2Fjson Currently fails even with liftover turned off, succeeds when mapping is turned off or when transcripts are selected.

Related issues:

151 Large variants return internal server errors.

464 Variant NC_000010.10:g.11330462A>G throws an unknown error.

473 Normalized description does not validate, but unnormalized description does.

Peter-J-Freeman commented 1 year ago

@ifokkema . Thanks for the suggestions.

In the first instance, I want to tweak the code to reduce the likelihood of this happening.

For the second 2 points, we could perhaps perform an auto-limit. We could write a filter into VV that states

if select_transcripts == 'all':
    Fetch all transcripts
    Loop through and remove all but the latest version of each transcript
    Process on only the latest versions

@ifokkema @leicray @John-F-Wagstaff , thoughts on this please?

@ifokkema , Do we also roll this out to the VF / LOVD processing tool?

We can look at the other aspects of this issues in stages

leicray commented 1 year ago

Related issue:

473 Normalized description does not validate, but unnormalized description does

ifokkema commented 1 year ago

For the second 2 points, we could perhaps perform an auto-limit. We could write a filter into VV that states

if select_transcripts == 'all':
    Fetch all transcripts
    Loop through and remove all but the latest version of each transcript
    Process on only the latest versions

@ifokkema @leicray @John-F-Wagstaff , thoughts on this please?

That makes sense! Any idea if that would indeed help in my example as well? I can't currently retrieve the transcript list since currently, it doesn't handle that request.

@ifokkema , Do we also roll this out to the VF / LOVD processing tool?

Yes, please, that was actually my request. I don't submit genomic variants into the VV endpoint, because I have no way to limit the output and thereby speed up the call. And VV doesn't have the issue when submitting transcript-based variants into it, since it only maps to the genome and doesn't try mapping to other transcripts. So I actually only have the issue with the LOVD endpoint.

Peter-J-Freeman commented 1 year ago

Great stuff. I will try get this built ASAP.

As a temp work around, you can restrict select_transcripts to "select" which will at least filter down the result. If that is not the transcript of interest, you can use the gene2transcript endpoint to look at all available transcripts in the gene.

https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts/CELF2?content-type=application%2Fjson

So, @John-F-Wagstaff, @leicray . Are you happy with this. If select_transcripts is set to all, we do indeed return all transcripts but limited to the latest version. If users want an updated version, they will need re-validate and provide the transcrtipt ID.

This would reduce the number of server errors I believe

John-F-Wagstaff commented 1 year ago

This looks OK to me. With the dataset growing over time this always was going to be an issue we needed to handle, and the Ensembl data set is larger and as such will be more affected too.

We might want to run out a "latest release only" version of the VVTA at some point, at least for some of our endpoints, as this would be more efficient still. Unfortunately this runs into complexities as it only works for some use cases i.e when not using old transcript IDs. There would also be issues with lift-over for the limited dataset too, so your current solution is probably the simplest. We can save the more complex solution as a countermeasure if user numbers or other issues make it necessary.

Peter-J-Freeman commented 1 year ago

Good plan @John-F-Wagstaff. I think the quickest fix is to adjust VV and VF (Think it should be the same code by now anyway) but we can think about this as an additional layer in future.

ifokkema commented 1 year ago

As a temp work around, you can restrict select_transcripts to "select" which will at least filter down the result.

That makes sense, thanks! That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?

If that is not the transcript of interest, you can use the gene2transcript endpoint to look at all available transcripts in the gene.

Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.

So, @John-F-Wagstaff, @leicray . Are you happy with this. If select_transcripts is set to all, we do indeed return all transcripts but limited to the latest version. If users want an updated version, they will need re-validate and provide the transcrtipt ID.

Sounds good!

Peter-J-Freeman commented 1 year ago

Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.

No but does work with a transcript ID

That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?

select will return all transcripts that were considered canonical. These (at least pre MANE) were often the longest transcript isoforms, so a good chance a transcript will be returned. The transcript ID can then be piped into the gene to transcript.

I will try get the fix done next week though. Need some hands on coding time

Peter-J-Freeman commented 1 year ago

Making progress

I have added code to the relevant VV function and for variant NC_000010.10:g.11330462A>G

['NM_001025076.2', 'NM_001025077.3', 'NM_001083591.1', 'NM_001326317.1', 'NM_001326317.2', 'NM_001326318.1', 'NM_001326318.2', 'NM_001326319.1', 'NM_001326319.2', 'NM_001326320.1', 'NM_001326320.2', 'NM_001326321.1', 'NM_001326321.2', 'NM_001326323.1', 'NM_001326323.2', 'NM_001326324.1', 'NM_001326324.2', 'NM_001326325.1', 'NM_001326325.2', 'NM_001326326.1', 'NM_001326326.2', 'NM_001326327.1', 'NM_001326327.2', 'NM_001326328.1', 'NM_001326328.2', 'NM_001326329.1', 'NM_001326329.2', 'NM_001326330.1', 'NM_001326330.2', 'NM_001326331.1', 'NM_001326331.2', 'NM_001326332.1', 'NM_001326332.2', 'NM_001326333.1', 'NM_001326333.2', 'NM_001326334.1', 'NM_001326334.2', 'NM_001326335.1', 'NM_001326335.2', 'NM_001326336.1', 'NM_001326336.2', 'NM_001326337.1', 'NM_001326337.2', 'NM_001326338.1', 'NM_001326338.2', 'NM_001326339.1', 'NM_001326339.2', 'NM_001326340.1', 'NM_001326340.2', 'NM_001326341.1', 'NM_001326341.2', 'NM_001326342.1', 'NM_001326342.2', 'NM_001326343.1', 'NM_001326343.2', 'NM_001326344.1', 'NM_001326344.2', 'NM_001326345.1', 'NM_001326345.2', 'NM_001326346.1', 'NM_001326346.2', 'NM_001326347.1', 'NM_001326348.1', 'NM_001326348.2', 'NM_001326349.1', 'NM_001326349.2', 'NM_006561.3', 'NM_006561.4']

is being filtered to this

['NM_001025076.2', 'NM_001025077.3', 'NM_001083591.1', 'NM_001326317.2', 'NM_001326318.2', 'NM_001326319.2', 'NM_001326320.2', 'NM_001326321.2', 'NM_001326323.2', 'NM_001326324.2', 'NM_001326325.2', 'NM_001326326.2', 'NM_001326327.2', 'NM_001326328.2', 'NM_001326329.2', 'NM_001326330.2', 'NM_001326331.2', 'NM_001326332.2', 'NM_001326333.2', 'NM_001326334.2', 'NM_001326335.2', 'NM_001326336.2', 'NM_001326337.2', 'NM_001326338.2', 'NM_001326339.2', 'NM_001326340.2', 'NM_001326341.2', 'NM_001326342.2', 'NM_001326343.2', 'NM_001326344.2', 'NM_001326345.2', 'NM_001326346.2', 'NM_001326347.1', 'NM_001326348.2', 'NM_001326349.2', 'NM_006561.4']

for variant NC_000010.10:g.11330462A>G

Peter-J-Freeman commented 1 year ago

This code

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000010.10:g.11330462A>G'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(validation.keys())

returns data for the following

dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.4:c.923A>G', 'metadata'])

I am checking that non of the transcripts are duplicated. Please someone also double check for me before I commit to changing the unit tests to accommodate this.

John-F-Wagstaff commented 1 year ago

I have checked each NM_[ACC_NO].[VER_NO]:c.[LOC]A>G and none of the accession numbers are duplicated, i.e. there are no variants with the same accession and different version numbers.

ifokkema commented 1 year ago

Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.

No but does work with a transcript ID

But that's exactly what I meant to get from that API, so I don't have that input.

That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?

select will return all transcripts that were considered canonical. These (at least pre MANE) were often the longest transcript isoforms, so a good chance a transcript will be returned. The transcript ID can then be piped into the gene to transcript.

I guess that could do in the worst-case scenario.

returns data for the following

dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.4:c.923A>G', 'metadata'])

I am checking that non of the transcripts are duplicated. Please someone also double check for me before I commit to changing the unit tests to accommodate this.

Nice! I don't see any duplications either.

Peter-J-Freeman commented 1 year ago

Ace. I have managed to come up with a solution that disables this during testing. Just running tests now, will add one for this issue then push up

Peter-J-Freeman commented 1 year ago

OK, this is how testing works

import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000010.10:g.11330462A>G'
genome_build = 'GRCh38'
select_transcripts = 'all'
vval.testing = True
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(validation.keys())

Basically it switches off the filtering and returns

dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.1:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.1:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.1:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.1:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.1:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.1:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.1:c.995A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.1:c.938A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.1:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.1:c.830A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.1:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.1:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.1:c.902A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.1:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.1:c.218A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.1:c.830A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.1:c.902A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.1:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.1:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.1:c.569A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.1:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.1:c.923A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.1:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.1:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.1:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.1:c.830A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.1:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.1:c.218A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.1:c.830A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.1:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.3:c.923A>G', 'NM_006561.4:c.923A>G', 'metadata'])

For this variant, the number of transcripts in normal mode is 38, in test mode it is 70. This will certainly cut validation times.

Peter-J-Freeman commented 1 year ago

@ifokkema . Is this better now we only filter for the latest version of the transcript?

ifokkema commented 1 year ago

The test from the report currently runs at 41 seconds, so no failure, but might cause a timeout internally in LOVD. With lift-over off, it takes 20 seconds. With lift-over on but transcripts set to select, it runs at 1.6 seconds. So that's very much acceptable. I should find a way to implement this "select" option when LOVD understands this is a big variant.

ifokkema commented 1 year ago

Note that, it would still be cool if VV realized that changing the transcripts to select would "save" the request. Otherwise, I'll implement it into my LOVD library such that at least the various LOVD services can benefit from this.

Peter-J-Freeman commented 1 year ago

@ifokkema Can you clarify the request. Not quite sure I follow. Are you saying that if the variant is big (lets say >100k) then auto-select the select option and warn? That could work. Would be overridden if specific transcripts are sent or mane / mane select is requested

ifokkema commented 1 year ago

That indeed, or, when VV realizes there are a lot of transcripts overlapping (even when "all" is passed instead of "raw"), in combination with the size of the variant. Basically, any logic where VV realizes "this will take long and likely time out, but we can reduce transcripts and fix this". I agree that specific transcripts, "select", "mane", or "mane_select" should override this feature. So:

I assume that the time VV takes to process the variant is somewhat related to size of variant times the number of transcripts overlapping, since the process needs to be repeated for every transcript, and the process takes longer for larger variants. So perhaps we can simply say that if the size of the variant times the number of transcripts matching is above 5M bases, VV decides to intervene? We should check if the processing time indeed relates to this simple math, and, if so, what the processing time is for 5M bases.

Peter-J-Freeman commented 1 year ago

Thanks for the workflow. Can take a look at this. Will implement something in the API first.

I think that the most simple solution would be something like

Sections marked (*) can be tweaked based on testing. I will write this as a function in the VV engine that can be called by interfaces. Hope to have time over the next few weeks