mozilla / fxa-amplitude-send

Data pipeline scripts for importing Firefox Accounts event data to Amplitude.
Mozilla Public License 2.0
6 stars 9 forks source link

Work out how to get Sync events into Amplitude #24

Closed philbooth closed 6 years ago

philbooth commented 6 years ago

The taxonomy defines a bunch of events related to Sync activity. We need to work out how we're going to get those events from Sync into Amplitude.

philbooth commented 6 years ago

Brain dump from conversation with Thom on the Sync team:

Plus some other notes from #datapipeline:

philbooth commented 6 years ago

From IRC:

16:36:06 <pb> jbuck: hey, amplitude question for you...
16:36:22 <pb> jbuck: do you know whether the FXA_AMPLITUDE_HMAC_KEY we use in our amplitude script the same as or different to the fxa.metrics_uid_secret_key that the tokenserver uses to hash stuff with here:
16:36:29 <pb> https://github.com/mozilla-services/tokenserver/blob/master/tokenserver/views.py#L117
16:36:52 <jbuck> different, and I see why that could be a problem :)
16:36:59 <pb> jbuck: :)
16:37:06 <jbuck> this for sync sending amplitude metrics, right?
16:37:10 <pb> jbuck: yep
16:37:15 <jbuck> okay

So as it stands, the uids from those parquet files will not match up with the uids we're sending for FxA data at the moment.

Is it okay to break continuity with the Amplitude data we've collected so far, and change the hash key used in this repo so that we can have Sync event data that matches up with FxA event data? Seems like a small amount of pain to take now in exchange for unsurpassed euphoria of joined-up metrics analysis we would have in the future.

philbooth commented 6 years ago

@davismtl, as per the preceding comment, currently the uid we send to Amplitude is hashed with a different key to the uid from Sync telemetry. I assumed this meant we'd have to change the key currently used for Amplitude, which would cause discontinuity in our event data. Annoying but not the end of the world at this point, as there's only a month or whatever of data in there.

However, this morning @rfk mentioned that changing the hash key in the token server might not be a huge problem for the Sync folks, in which case we'd be able to preserve continuity in Amplitude (at the cost of continuity for Sync metrics). Tbh I'm ambivalent about which option is best, should we have a 3-way conversation between me, you and @thomcc / @mhammond to agree on the least bad option for everyone?

philbooth commented 6 years ago

Apologies to @irrationalagent, I should have mentioned you in the preceding comment about Sync metrics uid too.

philbooth commented 6 years ago

More from IRC:

15:57:23 <tcsc> pb: i feel like rfk is in the best position to answer questions about the difficulty of changing things about the tokenserver
15:57:44 <tcsc> the client literally just passes hashed_fxa_uid from the tokenserver into the ping
15:59:00 <pb> tcsc: we did chat about it this morning and he seemed pretty relaxed about it as a change. the only visible effect should be in sync metrics?
15:59:47 <tcsc> the client literally just passes hashed_fxa_uid from the tokenserver into the ping?
16:00:35 <tcsc> ah, yeah, i think there are some charts on redash it might break
16:01:01 ⇐ army1349 quit (Thunderbird@moz-v8f9oj.techsys.cz) Ping timeout: 121 seconds
16:01:01 <tcsc> leifoines: do you use the uid in any redash queries?
16:02:35 <tcsc> i think most of the charts that use them don't work so it might be fine, but i'm also not sure if the validation charts use it, and how bothered we'd be if it had a big discontinuity
16:04:13 <tcsc> (my suspicion is we wouldn't care, but leifoines would know better than i would
16:05:56 <pb> tcsc: thanks for the info, i'll badger leif about it :)
16:08:33 → Kanchan_QA and FlorinMezei|3 joined  ⇐ bogdan_maris quit  
16:16:36 <leifoines> i do use the uid field to count unique users. if we need to change something so theres a temporary gap in data, then i think it should be fine (sorry if im slow to respond im at onboarding)
16:17:00 → army1349 joined  ⇐ FlorinMezei|2 quit  
16:18:19 <tcsc> hmmm
16:18:53 <leifoines> is there a github issue or bug that outlines what you are talking about
16:19:12 <tcsc> https://github.com/mozilla/fxa-amplitude-send/issues/24#issuecomment-342861090
16:19:18 <leifoines> thx
16:19:44 <tcsc> specifically there's a reason to change the key hashed_fxa_uid is hashed with
16:19:46 <tcsc> i gues
16:19:53 <tcsc> and that will change all the UIDs and device ids
16:23:04 <leifoines> so am i right in inferring that either the sync data or the fxa amplitude data will have a discontinuity, and we have to choose?
16:23:29 <leifoines> or do nothing, which doesnt really seem like a choice because we def want consistent uids across sync and fxa
16:25:00 <tcsc> that sounds about right
16:25:22 <leifoines> my first gut instinct is to do things so it preserves the data for amplitude, as we are currently tracking a weird decrease in registration rates that I think we'd like to keep an eye on without interruption
16:25:36 <leifoines> (ADavis can weigh in on that)
16:25:52 <tcsc> pb: is there no way to add the sync-style hashed uid to the amplitude data?
16:26:08 <tcsc> all i really know about amplitude is that it has lots of columns and adding new ones isn't too hard
16:26:37 <tcsc> but i could believe that backfilling that is hard or impossible
16:27:54 <tcsc> leifoines: the only real concern i have is that it will make determining things with the event telemetry data very hard in redash, and it will make determining introduction of validation errors over time hard
16:28:11 <tcsc> but both of those are graphs we actually don't have? and maybe the first one doesnt matter since the data will be in amplitude
16:28:51 <leifoines> we have a version of the introduction rate but im not sure its accurate at this point
16:28:56 <tcsc> right
16:29:10 <leifoines> ok i gtg, ill check back on this when i have a spare moment
16:29:16 <tcsc> i think we have two versions and neither agree
16:29:29 <tcsc> and tbh neither really make sense either in terms of numbers
16:35:56 <pb> tcsc, re: using an extra column, amplitude does special magic with the user_id property to link events together, so whichever we go with will be set as user_id. an extra column wouldn't help in that way
16:36:03 <tcsc> oh i see
16:36:07 <tcsc> that's a shame
16:36:25 <pb> tcsc, leifoines: i don't think this is a super-urgent decision btw, there's no need for us to rush into anything
16:36:49 <tcsc> i can't think of anything we could easily do on the client to support this
16:36:58 <tcsc> well, it probably doesnt matter anyway
16:37:21 <tcsc> changing the uid is probably fine, the cases it will break probably aren't the really important ones anyway
16:39:20 <tcsc> and yeah there's no rush, although if we were going to change it, doing it relatively sooner is better since it would mean that when we are doing the repair work (in Q1 AIUI we plan on it, since we're moving repair code to a system addon) we'd be able to look at introduction rates.
16:39:30 <tcsc> but there is plenty of time
16:39:39 <tcsc> e.g. months.
16:43:54 <pb> 👍
davismtl commented 6 years ago

What version of the uid is Marketing using in SalesForce?

If they are already aligned with FxA already, I would propose that Sync aligns with FxA.

philbooth commented 6 years ago

@davismtl, we don't have the data yet but I think it will be the raw, unhashed uid we get in the SalesForce data, which we could hash either way. Not entirely sure why I think that now though, so I may be wrong. 😕

davismtl commented 6 years ago

Please double check with Bniolet. I think that whatever choice we decide will be based on making the smallest amount of changes to break as few data timelines as possible.

mhammond commented 6 years ago

IIUC, the entire point of us getting that hashed ID from the token server was so that we do use the same ID as other telemetry used by FxA. Has that other existing telemetry now gone away?

We do occasionally run queries that would break if the UID changes - some ad-hoc ones around validation and Thom mentions some redash ones - while I don't think that would be the end of the world, changing amplitude while it remains so new seems the most prudent approach if it isn't significantly different.

rfk commented 6 years ago

we don't have the data yet but I think it will be the raw, unhashed uid we get in the SalesForce data

I am 100% confident it will be the raw unhashed FxA uid.

rfk commented 6 years ago

IIUC, the entire point of us getting that hashed ID from the token server was so that we do use the same ID as other telemetry used by FxA.

I don't recall this specifically, but maybe we arranged it so that the uids reported by sync telemetry would match up with the hashed uids in the FxA redshift database? @jbuck can probably find out whether the hmac keys used in both places are the same.

My recollection of the main reason for using the hashed ID was simply so that we could use some stable identifier attached to the account, without being able to correlate the telemetry pings directly back to the user's account record in the db.

mhammond commented 6 years ago

My recollection of the main reason for using the hashed ID

https://bugzilla.mozilla.org/show_bug.cgi?id=1288445 has more context.

rfk commented 6 years ago

Where the bug says "It turns out that sql.tmo only has these hashed", I suspect it was referring to the redash metrics emitted by the sync server, rather than the fxa metrics. Either way, we should definitely check whether tokenserver is currently using the same secret key as the fxa redash metrics.

philbooth commented 6 years ago

Either way, we should definitely check whether tokenserver is currently using the same secret key as the fxa redash metrics.

I don't have access to the mozilla-services org any more but in the old local copy of puppet-config that still exists on my machine (is it okay that I still have that?), it's set in fxa_flow_csv.lua to a value from config, analysis_hmac_key.

@jbuck, do you know if the analysis_hmac_key variable in fxa.toml.erb from puppet-config has the same value as fxa.metrics_uid_secret_key in the token server?

philbooth commented 6 years ago

@irrationalagent pulled some uids out of Sync data and FxA data in redash, and there didn't appear to be any crossover. So we may need to break continuity in 2 out of 3 places.

@davismtl, this is the issue we were discussing during the OKR updates just now.

philbooth commented 6 years ago

Moving this out of the active column but hopefully I can wrap it up in/after Austin. Code is mostly done, here:

https://github.com/mozilla/fxa-amplitude-send/compare/phil/sync?expand=1

philbooth commented 6 years ago

Regarding the HMAC key for the token server, I'm going to start the pi-request process in a second to co-ordinate testing on stage.

In addition to that, we should add automated tests to this repo and the token server that use known values and keys and ensure the HMAC result is what we expect in both places.

rfk commented 6 years ago

from mtg: we should consider putting these into a different project in amplitude, it's now possible to do cross-project views of data.

philbooth commented 6 years ago

Just to wrap this up, these events can now be seen in the prod Amplitude project:

Screenshot of a chart in Amplitude showing events from Sync telemetry data

https://analytics.amplitude.com/org/31251/chart/o7js9en

davismtl commented 6 years ago

@philbooth how often to the values get refreshed? Daily?

davismtl commented 6 years ago

@philbooth when comparing to Re:Dash, the numbers really don't line up. Amplitude has about 1% of events.

https://sql.telemetry.mozilla.org/dashboard/sync-events_1

philbooth commented 6 years ago

how often to the values get refreshed? Daily?

Yep. Currently the cron job runs at 01:45 UTC each day.

when comparing to Re:Dash, the numbers really don't line up. Amplitude has about 1% of events.

Okay, I'll dig into it and report back. Fortunately this is easier to debug than the FxA stuff, because I have access to the raw data. :)

philbooth commented 6 years ago

@davismtl, from that dashboard, I'm going to assume the chart you are referring to is this one:

https://sql.telemetry.mozilla.org/queries/50128/source#134679

That chart is grouping the counts by submission date, rather than the server timestamp associated with each event. The submission date only represents when the data landed in S3. Looking at the raw data, each submission date includes events that actually occurred over a range of dates (mostly on or near the submission date, but with a long-ish tail going further back. (this is why you can see some events for December and January in Amplitude, even though I only started submitting data yesterday)

So doing a direct comparison between that chart and Amplitude is tricky, but we know there are the equivalent of 2 submission dates in Amplitude right now. From those 2 submission dates we have about ~2,800 tab sent events and 1,200 tab received events, so let's call the respective rates over that last 2 days ~1,400 per day (sent) and ~600 per day (received). That makes the counts in Amplitude about 50% of the counts in Redash - a significant discrepancy, but at least it's better than 1%.

I'm still digging in, just wanted to flag that part as I found it.

philbooth commented 6 years ago

@davismtl, I've updated my import script to report proper numbers of the event counts it sends for each submission date and, comparing that to what's in Redash, I think things look pretty good.

For instance, the last 4 days look like this:

submission_date tab_sent (redash) tab_sent (amplitude) tab_received (redash) tab_received (amplitude)
2018-02-15 2749 2742 1340 1338
2018-02-16 2816 2782 1341 1334
2018-02-17 2074 2056 1241 1232
2018-02-18 2193 2186 946 943

The small discrepancies are due to the Amplitude import script skipping events that don't have a serverTime set. I'm not sure exactly what causes that to happen on the Sync side of things, but ditching those events seems like the right thing to do. What do you think?

mhammond commented 6 years ago

serverTime was added in Firefox 55, although I doubt that non-desktop clients ever send it. Do you have access to a sample ping without that field in events?

philbooth commented 6 years ago

Ah, thanks @mhammond. Updated my script to report app_version and they're all from pre-55 clients, the events that don't have it.