mozilla-rally / rally-core-addon

Mozilla Public License 2.0
9 stars 13 forks source link

handle pings that go over 1MB limit #637

Open rhelmer opened 3 years ago

rhelmer commented 3 years ago

The Telemetry ingestion servers require custom pings to be under 1MB, for legacy pings rally.js should provide some way to detect (and possibly chunk into multiple pings) if the data goes over this limit.

/cc @Dexterp37 @hamilton @bchivers-stanford

P.S. I am assuming that glean.js might want to handle this itself, so just focusing on the legacy pings in this issue, happy to change this.

rhelmer commented 3 years ago

At the very least I think we should have some kind of safe estimate in sendPing if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.

@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.

Dexterp37 commented 3 years ago

@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.

Note that automatically splitting pings will not work with Glean, too. I think 1MB is quite a lot of stuff to pack in a ping anyway, especially if compressed!

Dexterp37 commented 3 years ago

At the very least I think we should have some kind of safe estimate in sendPing if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.

Telemetry already does this and records how frequently this happens.

rhelmer commented 3 years ago

At the very least I think we should have some kind of safe estimate in sendPing if the current payload + overhead is going to go over 1MB, and throw an exception... chunking automatically sounds nice in theory but it could get unreasonably expensive if we're not thoughtful about it.

Telemetry already does this and records how frequently this happens.

That's useful, thanks... this won't throw an exception that the study can catch though right, it'll just be silently discarded (silent from the study PoV I mean, I see it's logged in Firefox)

rhelmer commented 3 years ago

@knowtheory how terrible would it be if we simply threw an exception if the ping was too large, and required that the study deal with it, as a starting point? I suspect that automatically splitting pings into multiple batches is going to be a can of worms we'll want to handle separately.

Note that automatically splitting pings will not work with Glean, too. I think 1MB is quite a lot of stuff to pack in a ping anyway, especially if compressed!

Yes I agree, the issue we're worried about here is that the https://github.com/bchivers-stanford/local_news_demand study wants to collect page contents, which could potentially be quite large.

The main concern I have is whether the study can easily tell that it's trying to send something that won't work, and have a way to do something different.

However given that it's recorded in Firefox per https://github.com/mozilla-rally/rally-core-addon/issues/637#issuecomment-852275308 I suppose we can just measure how often this happens and figure out if/when we need to deal with it?

By design, I don't think we should be able to tell 100% which study is causing these overflows though, if we depend on Firefox telemetry for this. It also doesn't really cover the issue for Chrome (but again that's something we should probably discuss in the context of glean)

rhelmer commented 3 years ago

I'm more worried about being able to tell this is happening and which study is causing it, and being able to let that study know somehow so it can handle it, vs. trying to do some kind of auto-splitting, to be clear :)

Dexterp37 commented 3 years ago

That's useful, thanks... this won't throw an exception that the study can catch though right, it'll just be silently discarded (silent from the study PoV I mean, I see it's logged in Firefox)

Right, it will be discarded, with an error message being log + metrics being recorded for the event.

Yes I agree, the issue we're worried about here is that the https://github.com/bchivers-stanford/local_news_demand study wants to collect page contents, which could potentially be quite large.

My 0.02€: do you have any idea of what's the 90 percentile of the sizes? What's the structure of the data they are trying to collect? 1 mb of compressed TEXT is quite a lot. Do you know already they're going to pierce through that limit?

The main concern I have is whether the study can easily tell that it's trying to send something that won't work, and have a way to do something different.

That specific study could create a Memory Distribution metric in glean and send it over once per day/per startup or so in another ping.

However given that it's recorded in Firefox per #637 (comment) I suppose we can just measure how often this happens and figure out if/when we need to deal with it?

Yes.

By design, I don't think we should be able to tell 100% which study is causing these overflows though, if we depend on Firefox telemetry for this. It also doesn't really cover the issue for Chrome (but again that's something we should probably discuss in the context of glean)

Note that 1 MB upload is quite a bit already: what's the frequency of the upload? Are you sure you're hitting this consistently? What's the 90p of the payload size or what's the average size?

magorlick commented 3 years ago

I feel this issue is a blocker for launching the Stanford study because their design will hit this upper limit quickly and reject the data. LMK if I am not understanding the scope here.

Dexterp37 commented 3 years ago

I feel this issue is a blocker for launching the Stanford study because their design will hit this upper limit quickly and reject the data.

Would it be possible to take a look at the design to better understand the requirement? @magorlick

I don't seem to fully understand the breath of the requirements here: the compressed text content of a web page doesn't really seem to be the thing that hits the limit of 1 MB..

magorlick commented 3 years ago

We have no sense of the distribution of page text sizes because we haven't collected any data like this before. I think there were concerns around pages with infinite scroll depths or other cases. We want to set them up for success at launch by 1) warning them this could be an issue and 2) articulating how to check for it proactively.

Dexterp37 commented 3 years ago

We have no sense of the distribution of page text sizes because we haven't collected any data like this before.

I understand, but until we have some measurements about this, it's unlikely the limit would be bumped as that's a platform wide change.

I think there were concerns around pages with infinite scroll depths or other cases.

Thanks for clarifying. It might be wise breaking down the potential edge cases in the study itself and provide proper measurements about it (e.g. send an occasional ping with summary statistics about what gets encountered).

After we'll have some statistics about the problem we might be able to formulate a different solution or limit on the ingestion side.

magorlick commented 3 years ago

We have error tracking in our legacy telemetry that will help us measure this to understand the scope of the issue. Thus, we are unblocked from launching the study as is.