Open rhelmer opened 3 years ago
At the very least I think we should have some kind of safe estimate in sendPing
if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.
@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.
@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.
Note that automatically splitting pings will not work with Glean, too. I think 1MB is quite a lot of stuff to pack in a ping anyway, especially if compressed!
At the very least I think we should have some kind of safe estimate in
sendPing
if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.
Telemetry already does this and records how frequently this happens.
At the very least I think we should have some kind of safe estimate in
sendPing
if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.Telemetry already does this and records how frequently this happens.
That's useful, thanks... this won't throw an exception that the study can catch though right, it'll just be silently discarded (silent from the study PoV I mean, I see it's logged in Firefox)
@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.
Note that automatically splitting pings will not work with Glean, too. I think 1MB is quite a lot of stuff to pack in a ping anyway, especially if compressed!
Yes I agree, the issue we're worried about here is that the https://github.com/bchivers-stanford/local_news_demand study wants to collect page contents, which could potentially be quite large.
The main concern I have is whether the study can easily tell that it's trying to send something that won't work, and have a way to do something different.
However given that it's recorded in Firefox per https://github.com/mozilla-rally/rally-core-addon/issues/637#issuecomment-852275308 I suppose we can just measure how often this happens and figure out if/when we need to deal with it?
By design, I don't think we should be able to tell 100% which study is causing these overflows though, if we depend on Firefox telemetry for this. It also doesn't really cover the issue for Chrome (but again that's something we should probably discuss in the context of glean)
I'm more worried about being able to tell this is happening and which study is causing it, and being able to let that study know somehow so it can handle it, vs. trying to do some kind of auto-splitting, to be clear :)
That's useful, thanks... this won't throw an exception that the study can catch though right, it'll just be silently discarded (silent from the study PoV I mean, I see it's logged in Firefox)
Right, it will be discarded, with an error message being log + metrics being recorded for the event.
Yes I agree, the issue we're worried about here is that the https://github.com/bchivers-stanford/local_news_demand study wants to collect page contents, which could potentially be quite large.
My 0.02€: do you have any idea of what's the 90 percentile of the sizes? What's the structure of the data they are trying to collect? 1 mb of compressed TEXT is quite a lot. Do you know already they're going to pierce through that limit?
The main concern I have is whether the study can easily tell that it's trying to send something that won't work, and have a way to do something different.
That specific study could create a Memory Distribution metric in glean and send it over once per day/per startup or so in another ping.
However given that it's recorded in Firefox per #637 (comment) I suppose we can just measure how often this happens and figure out if/when we need to deal with it?
Yes.
By design, I don't think we should be able to tell 100% which study is causing these overflows though, if we depend on Firefox telemetry for this. It also doesn't really cover the issue for Chrome (but again that's something we should probably discuss in the context of glean)
Note that 1 MB upload is quite a bit already: what's the frequency of the upload? Are you sure you're hitting this consistently? What's the 90p of the payload size or what's the average size?
I feel this issue is a blocker for launching the Stanford study because their design will hit this upper limit quickly and reject the data. LMK if I am not understanding the scope here.
I feel this issue is a blocker for launching the Stanford study because their design will hit this upper limit quickly and reject the data.
Would it be possible to take a look at the design to better understand the requirement? @magorlick
I don't seem to fully understand the breath of the requirements here: the compressed text content of a web page doesn't really seem to be the thing that hits the limit of 1 MB..
We have no sense of the distribution of page text sizes because we haven't collected any data like this before. I think there were concerns around pages with infinite scroll depths or other cases. We want to set them up for success at launch by 1) warning them this could be an issue and 2) articulating how to check for it proactively.
We have no sense of the distribution of page text sizes because we haven't collected any data like this before.
I understand, but until we have some measurements about this, it's unlikely the limit would be bumped as that's a platform wide change.
I think there were concerns around pages with infinite scroll depths or other cases.
Thanks for clarifying. It might be wise breaking down the potential edge cases in the study itself and provide proper measurements about it (e.g. send an occasional ping with summary statistics about what gets encountered).
After we'll have some statistics about the problem we might be able to formulate a different solution or limit on the ingestion side.
We have error tracking in our legacy telemetry that will help us measure this to understand the scope of the issue. Thus, we are unblocked from launching the study as is.
The Telemetry ingestion servers require custom pings to be under 1MB, for legacy pings
rally.js
should provide some way to detect (and possibly chunk into multiple pings) if the data goes over this limit./cc @Dexterp37 @hamilton @bchivers-stanford
P.S. I am assuming that glean.js might want to handle this itself, so just focusing on the legacy pings in this issue, happy to change this.