Closed amkwong closed 4 years ago
Incidentally, per the difference in API response format ({ data: []}
vs []
), there is an additional tabulator option that may be of use:
http://tabulator.info/docs/4.6/data#ajax-response
The reason our API includes a key called data is that sometimes, an API is returning something else. For example, some API design guides call for another top-level key errors
to provide helpful messages when a query goes wrong. (errors aren't data, so returning data
under a nested key leaves room in the design for other features)
Incidentally, per the difference in API response format (
{ data: []}
vs[]
), there is an additional tabulator option that may be of use: http://tabulator.info/docs/4.6/data#ajax-response
I incorporated changes to use this callback and removed the redundant API endpoint in 7bc3e74.
Updates in 042f39d: I added the piponly option to the range API (?piponly=True will return only points with non-missing PIP values), and added an extra comment for its potential future use - If large regions sending too much data becomes a problem, then we can use this filter as a first attempted fix.
If large regions sending too much data becomes a problem, then we can use this filter as a first attempted fix.
Overall it looks good for now and is ok to merge. Thanks for helping with the questions and I'm glad the tabulator option worked for you.
It's hard to comment on the strategy for limiting query size/time, because the following variables are hard for me to assess on my laptop:
- How much data (unfiltered) is involved in a typical real query for this use case? If the tabix query + parsing is excessive, we'll still have performance issues. In that case, a long term solution might prefer to do the filtering in, eg, a database (reduce the amount of data that python needs to act on)
This depends on the size of the region, but for a large (1mbp) dense region this can run into hundreds of thousands and possibly millions of data points. A database is probably a good option anyway, once I have some time to work on this.
- What fraction of data has pip=0? (eg, what % of the data is eliminated by using this filter?)
The DAPG database contains 21 million data points; the full GTEx v8 database contains somewhere in the range of 10 million points per tissue across 49 tissues, with some missing data, so a little under 500 million data points. Filtering by PIP should roughly remove 95% of the points, give or take.
- Does the PIP have a specific meaning that would lend itself to a more meaningful significance threshold than "present or absent"? (this could tie into the "top hits" concept of the original feature proposal)
Since PIP is a Bayesian posterior probability, the interpretation of whether the PIP value can be meaningfully parsed depends on how much you believe the generating model. Trying to set any kind of threshold for "significance" may be an impossible task (though some variants are obviously more important than others).
That said, if a variant was picked up by DAP-G to be included, then it means there's at least some evidence for the variant being important (or at least in LD with something that's important), and that there's a chance it'll be interesting to a researcher out there somewhere. This is the reason I went with the "present in the PIP database" criterion for the filter.
Thanks for the thoughtful and informative answers; very helpful!
Based on these comments, it sounds like this will help with payload size. It also sounds like an improved backend could yield further gains (by removing 95% of the parsing time spent looking at rows of interest).
Ticket
N/A
Purpose
How to test this
Deployment / configuration notes
(none)