mozilla / testpilot

Test Pilot is a platform for performing controlled tests of new product concepts in Firefox
https://testpilot.firefox.com/
251 stars 123 forks source link

Build a wrapper for submitExternalPing() #479

Closed clouserw closed 8 years ago

clouserw commented 8 years ago

The way the metrics plan is developing is to use Telemetry to record our measurement data. This means using Telemetry's submitExternalPing(). This issue is to add a function to TestPilot which allows test pilot clients to easily record data with us. (basically a light wrapper around submitExternalPing())

The "type" there should be 'testpilot' and we do want to enable the clientID and environment. The documentation shows what a ping should look like. I think we'll also want TestPilot to add the user agent and a version along with whatever the client sends.

This issue also includes writing documentation for using the new function. I'd suggest putting it at the top of https://wiki.mozilla.org/Test_Pilot/Metrics

mostlygeek commented 8 years ago

There are several ways to get data in. @whd would be the authoritative source for recommendations on how best to send metrics.

  1. via telemetry, as part of the regular FF telemetry upload
  2. http => testpilot => statsd (datadog)
  3. http => testpilot => heka => ElasticSearch => Kibana
  4. http => testpilot => heka => (pipeline?) => ??

Figuring out which of these paths data will flow will help getting the backing service setup.

lmorchard commented 8 years ago

This issue is just about using Telemetry on the client side from experiments. More complete details about our collection strategy are over here in README-METRICS.md.

lmorchard commented 8 years ago

Digging into this and thinking out loud, I think we're missing some detail in how this should work for individual experiments. In fact, the README-METRICS says this:

This document is about metrics about Test Pilot, not metrics about tests within Test Pilot. If you're looking for more information about measuring tests themselves please read the wiki.

There's the main testpilot summary ping, so I don't think we want individual experiments to also each send their own testpilot ping via submitExternalPing. Those pings would collide with the main daily summary.

From reading about Telemetry so far, I think experiments should contribute data to the main summary ping over the course of 24 hours. Something like this (with totally made-up measurements):

  {"tests":
    {"universal_search":
      {"last_enabled": 1457462200,
       "last_disabled": 1457461100,
       "features": {},
       "usage_count": 10,
       "suggestion_service_response_times": [150, 185, 200, 125],
       "cache_misses": 6
      }
  }

From there, I'm wondering what individual experiments will be measuring? That affects what goes into either per-experiment pings or main ping data contributions.

The general "Usage Rate per User" item from Core & Secondary Metrics seems like a candidate for the above - e.g. for a screenshotting test, the key usage event might be fired when a screenshot is taken. That could be captured in a usage_count property. An experiment could send a message to the Test Pilot add-on with every significant usage event, and that usage_count would be incremented.

Other Core & Secondary Metrics look like things we can partially answer with enable / disable timestamps. We might also want to include feedback events in the ping - related to #430. Other questions in there require server-side log events.

Beyond core & secondary metrics, what about custom per-experiment metrics? For example, in Universal Search, I think it was mentioned a few times that we might like to measure how often each position in the search results is clicked - i.e. do users just hit the first item, or do they scroll down and hit one farther down? Should something like that be included in the Telemetry ping?

clouserw commented 8 years ago

I'm expecting another "type" of ping added to the list of pings for test pilot. Then, tests within test pilot would use that type.

I was expecting tests to submit their own pings via this wrapper around submitExternalPing(). I'm not sure what colliding with the main daily summary means...there aren't collisions, everything will be recorded.

Some tests may want to get statistics more often than every 24 hours. Additionally, I think we need to give the telemetry folks a heads up if we want to modify the JSON schema for the test pilot ping (which, if we add new fields for each test and remove fields for all the old tests, seems like it would happen quite often).

For counting screenshots, I would encourage the add-ons to collect that number on the client and submit it periodically like you described, but I think as their own ping. The same goes for the custom per-experiment metrics like universal search's position tracking -- I would collect that data on the client and then submit it as a bundle.

clouserw commented 8 years ago

/cc @rafrombrc for comments on the above

lmorchard commented 8 years ago

I'm not sure what colliding with the main daily summary means...there aren't collisions, everything will be recorded.

Maybe I'm not understanding Telemetry? I'm thinking the collision would be this:

Seems like each experiment needs a metrics ping schema, and each schema needs to be associated with a unique type (e.g. testpilot-universal-search or testpilot-activity-streams). But, I don't think we should submit different shapes & sizes of pings under the same testpilot type.

clouserw commented 8 years ago

The current pings in the telemetry docs have things like metadata: {...} and payload: {...} which I was assuming meant there was some flexibility to those sections. I was thinking it would be something like:

{
  type: "testpilot",
  test: "universal_search"
  version: 1,
  clientId: <UUID>,
  environment: { ... }
  payload: {
     ...
  }
}

And the payload would be whatever that particular test wanted to record. That may not be realistic if Telemetry is going to require strict structures before recording data.

Rob said he'd take a look at this conversation when he was out of his meeting.

rafrombrc commented 8 years ago

My understanding is that payload represents the part of the ping for which we need the JSON schema. We already know the schema of the wrapper, we just need to know the form of your portion of the ping, which is what payload should contain. Providing us with the schema will allow us to do some validation of the input data and to make sure we're parsing everything correctly.

For the "testpilot" only pings this should be pretty straightforward. If we use pings for the testpilot tests it's a bit more complex, though, since we don't know up front what all of the different schemas will be, and they might come and go so frequently as to make managing the schema changes very unwieldy. @mreid-moz knows more about this than me, but I think a good approach would probably be to have another ping type, "testpilottest" maybe, that represents data sent from one of the tests. This would have a "payload" that included a static set of metadata that described the test, and a nested "testdata" JSON object that included the dynamic data that was specific to the test in question. The we could validate the metadata, and for the test-specific data we'd just parse as best we can and pass the results back to you to see if it makes any sense.

clouserw commented 8 years ago

That does make sense. In that case, this issue is about submitting that "testpilottest" type and I need to update the README-METRICS.md to talk about that.

mreid-moz commented 8 years ago

I think it makes sense to separate the results of tests from the instrumentation of the Test Pilot platform itself, so I'm generally in favour of having two different document types here (one for testpilot and another for the results from tests on the platform).

Please consider how you would like to analyze/query the data first though, since if the results will always need to be cross-referenced with the platform data, it may make sense to roll them into a single document. I suspect that the results will be desirable to submit at a different cadence / frequency than the platform info (and maybe even different from one test to another), so again, separate documents will most likely be the best option.

In terms of the schema for the documents themselves, I'll echo Rob that the primary aim is to be able to validate incoming data. So we can start with a schema that contains known structure, and extend it over time as needed.

clouserw commented 8 years ago

I made a PR with the new ping type at https://github.com/mozilla/testpilot/pull/498 . @mreid-moz - would you review/approve?

clouserw commented 8 years ago

Oh, we almost collided there. I think my update addresses your comment, and the different types/cadences makes a lot of sense. r? @rafrombrc also ^

lmorchard commented 8 years ago

Thought feature flipping would be a dependency for this, but I think I can get the API for this working before then at least. Going to grab for 3/25 and see if I can't get it done really quick

ghost commented 8 years ago

@lmorchard it's a small part of the patch, but in https://bugzilla.mozilla.org/show_bug.cgi?id=1255182 bsmedberg is suggesting that each test have its own namespace (vs testpilottest). I kind of like the idea, but don't really have experience here either. if @mreid-moz or @rafrombrc have comments, I'd love to hear them, otherwise I'm leaning that way. As far as this patch, I think that just means instead of hardcoding testpilottest we accept a string from the client.

lmorchard commented 8 years ago

bsmedberg is suggesting that each test have its own namespace (vs testpilottest)

Yeah, that's basically what I was thinking we'd need to do: A separate ping type for the main Test Pilot add-on and a different one for each experiment. (e.g. testpilot-universal-search or testpilot-activity-streams)

By the sound of things in that bug though... maybe some low-volume experiments could share a more generic ping type so we don't have to define a new one every time? I think it depends on what we want to measure and how it's defined in the schema.

lmorchard commented 8 years ago

I think I have a first stab at this in #538, but it could probably use some more Telemetry-expert eyes to be sure I'm doing something close to the real thing

rafrombrc commented 8 years ago

I'm checking w ops to make sure that an ever-changing set of Test Pilot test ping types won't cause problems w ingestion. I fear it might, but I honestly don't know. Will report back when I have more info.

lmorchard commented 8 years ago

900dc17da36bed1a769f8028f6a89e18ed6068cf