redesign the GUD url scheme

hamilton commented 4 years ago

GUD needs to have a bit easier to use URL scheme that is shorter. This could open up the door for much, much more expressive querying and utility on the client and server sides, including arbitrary comparisons between sets of query params. I'm sure this is reinventing someone's wheel, but if done right we might be able to continue to not have a more involved server component for GUD for another few years.

A few improvements:

(1) remove empty kv pairs such as country=[]. We can infer if it is empty that a default value will be set. (2) consder a short value that is a single alphanumeric for each key that maps to a single character, for instance US=>u. This is more meaningful when we can reduce something like All Firefox Desktop Activity to x. These alphanumeric shorts are unique to the dimension or metric. a-zA-Z0-9 contains 62 values, more than enough to represent almost all these dimensions going forward. In the case of usage criteria, a dimension which could go beyond 62 values, we could easily use two alphanumerics, yielding 3,844 values, or go with three - 238,828 values, just to be safe. In any case the reduction will still be pretty considerable, and throwing out a delimiter like a comma here keeps the length short. (3) dimension names can be short and dependent on the usage criterion specifically, reducing even further. If we follow (2) above, then we can make any dimension listing delimited by something like -, leaving something like the full country specification to be Cugb0, where C means country, and the rest of the alphanumerics represent individual countries. (4) we can leave in startDate and potentially other view filters as-is, since they are not specific to the dat itself. For dates, we could easily have sd414, representing number of days since jan 01 2015 or something like that. We can also change somthing like mode=explore to also just be a hash-route.

examples of compression

1

?startDate=2017-06-17&endDate=2020-04-04&mode=explore&usage=Any Firefox Desktop Activity&attributed=[]&metric=all&os=[]&language=[]&country=[]&channel=[] (153 chars)

#explore/?sd905&ed1032&v=e&q=Ufda (33 chars, ~21% the size of original)

2

/?startDate=2017-06-17&endDate=2020-04-04&mode=explore&usage=Any Firefox Desktop Activity&attributed=["TRUE"]&metric=all&os=["Windows_NT"]&language=[]&country=["DE"%2C"GB"%2C"US"]&channel=[] (190 chars)

#/explore/?sd905&ed1032&q=Ufda-AL-Ow-CdgU (41 chars, ~21% or original)

hamilton commented 4 years ago

cc @jklukas – any input on this from a design point of view? From the querying end, this entails a library that creates the compression / decompression function, so it will be completely invisible to the other bits of data engineering, mostly just looking for feedback.

This may be totally unnecessary for the forseeable future, but it would be worth talking about practically if being able to pass multiple queries like this to the server (and thus allow multiple queries to hit the tables to allow arbitrary comparisons across sets of dimensions) is something we can enable cheaply. In other words, let's say I can send three queries at once to the server. Would all three of those queries run at the same time, and would they return roughly around the same time, or would they have to be enqueued?

jklukas commented 4 years ago

cc @jklukas – any input on this from a design point of view? From the querying end, this entails a library that creates the compression / decompression function, so it will be completely invisible to the other bits of data engineering, mostly just looking for feedback.

This may be totally unnecessary for the forseeable future, but it would be worth talking about practically if being able to pass multiple queries like this to the server (and thus allow multiple queries to hit the tables to allow arbitrary comparisons across sets of dimensions) is something we can enable cheaply.

It seems desirable to avoid this additional complexity until we're forced into it. What you've laid out seems reasonable. I don't know if we ever expect use cases where folks would want to produce URLs programatically; it seems like that would become significantly more difficult with this kind of shortening.

In other words, let's say I can send three queries at once to the server. Would all three of those queries run at the same time, and would they return roughly around the same time, or would they have to be enqueued?

We could send all three queries at once to BQ and they would likely all execute in parallel; there's a possibility of queuing, but we're allowed 100 concurrent queries per project, so it's probably not relevant. I'd expect we'd mostly just expect to move a bit further down the latency tail since we'll be limited by whichever of the queries happens to take the longest to return.

hamilton commented 4 years ago

Thanks for the feedback. Thinking further, you're right that it's probably not worth the savings here. The maximum query string length in a url is 1024 chars. If we are in arbitrary comparison territory (which we could be in the next few months), that can support probably ~ 3-4 full param sets. wontfix + YAGNI for now in the interest of time.

mozilla / gud

redesign the GUD url scheme #72

examples of compression

1

2