Open ifokkema opened 1 year ago
@ifokkema . Thanks for the suggestions.
In the first instance, I want to tweak the code to reduce the likelihood of this happening.
For the second 2 points, we could perhaps perform an auto-limit. We could write a filter into VV that states
if select_transcripts == 'all':
Fetch all transcripts
Loop through and remove all but the latest version of each transcript
Process on only the latest versions
@ifokkema @leicray @John-F-Wagstaff , thoughts on this please?
@ifokkema , Do we also roll this out to the VF / LOVD processing tool?
We can look at the other aspects of this issues in stages
Related issue:
For the second 2 points, we could perhaps perform an auto-limit. We could write a filter into VV that states
if select_transcripts == 'all': Fetch all transcripts Loop through and remove all but the latest version of each transcript Process on only the latest versions
@ifokkema @leicray @John-F-Wagstaff , thoughts on this please?
That makes sense! Any idea if that would indeed help in my example as well? I can't currently retrieve the transcript list since currently, it doesn't handle that request.
@ifokkema , Do we also roll this out to the VF / LOVD processing tool?
Yes, please, that was actually my request. I don't submit genomic variants into the VV endpoint, because I have no way to limit the output and thereby speed up the call. And VV doesn't have the issue when submitting transcript-based variants into it, since it only maps to the genome and doesn't try mapping to other transcripts. So I actually only have the issue with the LOVD endpoint.
Great stuff. I will try get this built ASAP.
As a temp work around, you can restrict select_transcripts to "select" which will at least filter down the result. If that is not the transcript of interest, you can use the gene2transcript endpoint to look at all available transcripts in the gene.
So, @John-F-Wagstaff, @leicray . Are you happy with this. If select_transcripts is set to all, we do indeed return all transcripts but limited to the latest version. If users want an updated version, they will need re-validate and provide the transcrtipt ID.
This would reduce the number of server errors I believe
This looks OK to me. With the dataset growing over time this always was going to be an issue we needed to handle, and the Ensembl data set is larger and as such will be more affected too.
We might want to run out a "latest release only" version of the VVTA at some point, at least for some of our endpoints, as this would be more efficient still. Unfortunately this runs into complexities as it only works for some use cases i.e when not using old transcript IDs. There would also be issues with lift-over for the limited dataset too, so your current solution is probably the simplest. We can save the more complex solution as a countermeasure if user numbers or other issues make it necessary.
Good plan @John-F-Wagstaff. I think the quickest fix is to adjust VV and VF (Think it should be the same code by now anyway) but we can think about this as an additional layer in future.
As a temp work around, you can restrict select_transcripts to "select" which will at least filter down the result.
That makes sense, thanks! That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?
If that is not the transcript of interest, you can use the gene2transcript endpoint to look at all available transcripts in the gene.
Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.
So, @John-F-Wagstaff, @leicray . Are you happy with this. If select_transcripts is set to all, we do indeed return all transcripts but limited to the latest version. If users want an updated version, they will need re-validate and provide the transcrtipt ID.
Sounds good!
Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.
No but does work with a transcript ID
That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?
select will return all transcripts that were considered canonical. These (at least pre MANE) were often the longest transcript isoforms, so a good chance a transcript will be returned. The transcript ID can then be piped into the gene to transcript.
I will try get the fix done next week though. Need some hands on coding time
Making progress
I have added code to the relevant VV function and for variant NC_000010.10:g.11330462A>G
['NM_001025076.2', 'NM_001025077.3', 'NM_001083591.1', 'NM_001326317.1', 'NM_001326317.2', 'NM_001326318.1', 'NM_001326318.2', 'NM_001326319.1', 'NM_001326319.2', 'NM_001326320.1', 'NM_001326320.2', 'NM_001326321.1', 'NM_001326321.2', 'NM_001326323.1', 'NM_001326323.2', 'NM_001326324.1', 'NM_001326324.2', 'NM_001326325.1', 'NM_001326325.2', 'NM_001326326.1', 'NM_001326326.2', 'NM_001326327.1', 'NM_001326327.2', 'NM_001326328.1', 'NM_001326328.2', 'NM_001326329.1', 'NM_001326329.2', 'NM_001326330.1', 'NM_001326330.2', 'NM_001326331.1', 'NM_001326331.2', 'NM_001326332.1', 'NM_001326332.2', 'NM_001326333.1', 'NM_001326333.2', 'NM_001326334.1', 'NM_001326334.2', 'NM_001326335.1', 'NM_001326335.2', 'NM_001326336.1', 'NM_001326336.2', 'NM_001326337.1', 'NM_001326337.2', 'NM_001326338.1', 'NM_001326338.2', 'NM_001326339.1', 'NM_001326339.2', 'NM_001326340.1', 'NM_001326340.2', 'NM_001326341.1', 'NM_001326341.2', 'NM_001326342.1', 'NM_001326342.2', 'NM_001326343.1', 'NM_001326343.2', 'NM_001326344.1', 'NM_001326344.2', 'NM_001326345.1', 'NM_001326345.2', 'NM_001326346.1', 'NM_001326346.2', 'NM_001326347.1', 'NM_001326348.1', 'NM_001326348.2', 'NM_001326349.1', 'NM_001326349.2', 'NM_006561.3', 'NM_006561.4']
is being filtered to this
['NM_001025076.2', 'NM_001025077.3', 'NM_001083591.1', 'NM_001326317.2', 'NM_001326318.2', 'NM_001326319.2', 'NM_001326320.2', 'NM_001326321.2', 'NM_001326323.2', 'NM_001326324.2', 'NM_001326325.2', 'NM_001326326.2', 'NM_001326327.2', 'NM_001326328.2', 'NM_001326329.2', 'NM_001326330.2', 'NM_001326331.2', 'NM_001326332.2', 'NM_001326333.2', 'NM_001326334.2', 'NM_001326335.2', 'NM_001326336.2', 'NM_001326337.2', 'NM_001326338.2', 'NM_001326339.2', 'NM_001326340.2', 'NM_001326341.2', 'NM_001326342.2', 'NM_001326343.2', 'NM_001326344.2', 'NM_001326345.2', 'NM_001326346.2', 'NM_001326347.1', 'NM_001326348.2', 'NM_001326349.2', 'NM_006561.4']
for variant NC_000010.10:g.11330462A>G
This code
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000010.10:g.11330462A>G'
genome_build = 'GRCh38'
select_transcripts = 'all'
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(validation.keys())
returns data for the following
dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.4:c.923A>G', 'metadata'])
I am checking that non of the transcripts are duplicated. Please someone also double check for me before I commit to changing the unit tests to accommodate this.
I have checked each NM_[ACC_NO].[VER_NO]:c.[LOC]A>G and none of the accession numbers are duplicated, i.e. there are no variants with the same accession and different version numbers.
Would the latter also work on a genomic position as input? I may not even have the gene when I'm sending off the variant.
No but does work with a transcript ID
But that's exactly what I meant to get from that API, so I don't have that input.
That will always select something, right? Or are there cases where that would cause there to be no transcripts overlapping a position?
select will return all transcripts that were considered canonical. These (at least pre MANE) were often the longest transcript isoforms, so a good chance a transcript will be returned. The transcript ID can then be piped into the gene to transcript.
I guess that could do in the worst-case scenario.
returns data for the following
dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.4:c.923A>G', 'metadata'])
I am checking that non of the transcripts are duplicated. Please someone also double check for me before I commit to changing the unit tests to accommodate this.
Nice! I don't see any duplications either.
Ace. I have managed to come up with a solution that disables this during testing. Just running tests now, will add one for this issue then push up
OK, this is how testing works
import json
import VariantValidator
vval = VariantValidator.Validator()
variant = 'NC_000010.10:g.11330462A>G'
genome_build = 'GRCh38'
select_transcripts = 'all'
vval.testing = True
validate = vval.validate(variant, genome_build, select_transcripts)
validation = validate.format_as_dict(with_meta=True)
print(validation.keys())
Basically it switches off the filtering and returns
dict_keys(['flag', 'NM_001025076.2:c.830A>G', 'NM_001025077.3:c.902A>G', 'NM_001083591.1:c.830A>G', 'NM_001326317.1:c.830A>G', 'NM_001326317.2:c.830A>G', 'NM_001326318.1:c.830A>G', 'NM_001326318.2:c.830A>G', 'NM_001326319.1:c.830A>G', 'NM_001326319.2:c.830A>G', 'NM_001326320.1:c.830A>G', 'NM_001326320.2:c.830A>G', 'NM_001326321.1:c.830A>G', 'NM_001326321.2:c.830A>G', 'NM_001326323.1:c.830A>G', 'NM_001326323.2:c.830A>G', 'NM_001326324.1:c.830A>G', 'NM_001326324.2:c.830A>G', 'NM_001326325.1:c.995A>G', 'NM_001326325.2:c.995A>G', 'NM_001326326.1:c.938A>G', 'NM_001326326.2:c.938A>G', 'NM_001326327.1:c.938A>G', 'NM_001326327.2:c.938A>G', 'NM_001326328.1:c.830A>G', 'NM_001326328.2:c.830A>G', 'NM_001326329.1:c.830A>G', 'NM_001326329.2:c.830A>G', 'NM_001326330.1:c.830A>G', 'NM_001326330.2:c.830A>G', 'NM_001326331.1:c.902A>G', 'NM_001326331.2:c.902A>G', 'NM_001326332.1:c.902A>G', 'NM_001326332.2:c.902A>G', 'NM_001326333.1:c.218A>G', 'NM_001326333.2:c.218A>G', 'NM_001326334.1:c.830A>G', 'NM_001326334.2:c.830A>G', 'NM_001326335.1:c.902A>G', 'NM_001326335.2:c.902A>G', 'NM_001326336.1:c.902A>G', 'NM_001326336.2:c.902A>G', 'NM_001326337.1:c.902A>G', 'NM_001326337.2:c.902A>G', 'NM_001326338.1:c.569A>G', 'NM_001326338.2:c.569A>G', 'NM_001326339.1:c.569A>G', 'NM_001326339.2:c.569A>G', 'NM_001326340.1:c.923A>G', 'NM_001326340.2:c.923A>G', 'NM_001326341.1:c.923A>G', 'NM_001326341.2:c.923A>G', 'NM_001326342.1:c.923A>G', 'NM_001326342.2:c.923A>G', 'NM_001326343.1:c.923A>G', 'NM_001326343.2:c.923A>G', 'NM_001326344.1:c.830A>G', 'NM_001326344.2:c.830A>G', 'NM_001326345.1:c.830A>G', 'NM_001326345.2:c.830A>G', 'NM_001326346.1:c.218A>G', 'NM_001326346.2:c.218A>G', 'NM_001326347.1:c.854A>G', 'NM_001326348.1:c.830A>G', 'NM_001326348.2:c.830A>G', 'NM_001326349.1:c.830A>G', 'NM_001326349.2:c.830A>G', 'NM_006561.3:c.923A>G', 'NM_006561.4:c.923A>G', 'metadata'])
For this variant, the number of transcripts in normal mode is 38, in test mode it is 70. This will certainly cut validation times.
@ifokkema . Is this better now we only filter for the latest version of the transcript?
The test from the report currently runs at 41 seconds, so no failure, but might cause a timeout internally in LOVD. With lift-over off, it takes 20 seconds. With lift-over on but transcripts set to select
, it runs at 1.6 seconds. So that's very much acceptable. I should find a way to implement this "select" option when LOVD understands this is a big variant.
Note that, it would still be cool if VV realized that changing the transcripts to select
would "save" the request. Otherwise, I'll implement it into my LOVD library such that at least the various LOVD services can benefit from this.
@ifokkema Can you clarify the request. Not quite sure I follow. Are you saying that if the variant is big (lets say >100k) then auto-select the select option and warn? That could work. Would be overridden if specific transcripts are sent or mane / mane select is requested
That indeed, or, when VV realizes there are a lot of transcripts overlapping (even when "all" is passed instead of "raw"), in combination with the size of the variant. Basically, any logic where VV realizes "this will take long and likely time out, but we can reduce transcripts and fix this". I agree that specific transcripts, "select", "mane", or "mane_select" should override this feature. So:
I assume that the time VV takes to process the variant is somewhat related to size of variant times the number of transcripts overlapping, since the process needs to be repeated for every transcript, and the process takes longer for larger variants. So perhaps we can simply say that if the size of the variant times the number of transcripts matching is above 5M bases, VV decides to intervene? We should check if the processing time indeed relates to this simple math, and, if so, what the processing time is for 5M bases.
Thanks for the workflow. Can take a look at this. Will implement something in the API first.
I think that the most simple solution would be something like
Sections marked (*) can be tweaked based on testing. I will write this as a function in the VV engine that can be called by interfaces. Hope to have time over the next few weeks
Is your feature request related to a problem? Please describe. VariantValidator, when having received too much work for one request, times out. This happens when, e.g.,
The first point may not be recoverable, but options 2 and 3 are. While in option 3, additional options may simply be turned off, in the case of option 2, the user may not be aware of what transcripts overlap the given variant. The user may then not be able to simplify the request without some guidance.
Describe the solution you'd like It would be great if VV could predict that the given task will take too much time. If so, instead of timing out, VV could either:
mane_select
transcripts instead ofall
or turning off liftover) and mention this;Describe alternatives you've considered Internally, LOVD doesn't send variants that are too large to VV, because it expects a timeout. However, we can't always predict this. For instance, LOVD never knows how many transcripts VV knows that overlap a certain position.
Additional context Example: https://rest.variantvalidator.org/LOVD/lovd/hg19/NC_000010.10%3Ag.11330462A%3EG/refseq/all/tx/primary?content-type=application%2Fjson Currently fails even with liftover turned off, succeeds when mapping is turned off or when transcripts are selected.
Related issues:
151 Large variants return internal server errors.
464 Variant NC_000010.10:g.11330462A>G throws an unknown error.
473 Normalized description does not validate, but unnormalized description does.