Store country in our own data

davismtl commented 5 years ago

We currently pass country to Amplitude but we don't store it in our own dataset for viewing in Re:Dash.

Let's keep our own hard copy of country for use in Re:Dash.

Story: As a Firefox leader, I would like to be able to track changes in trends in our most important countries. (e.g. YoY)

vladikoff commented 5 years ago

from mtg: properties that are sent to Amplitude do not match Re:Dash, we should keep that consistent.

@philbooth is it difficult to fix this for Re:Dash?

philbooth commented 5 years ago

It's not difficult but it requires changes to the Heka filter, which means co-ordination with the data pipeline team and I know they're keen to push us away from Heka ASAP, so they may prefer us to spend time on doing that first (or maybe not, I'm only guessing at this point).

davismtl commented 5 years ago

There's no rush here. Not a priority.

irrationalagent commented 5 years ago

@jklukas kind of needs this for the growth dashboard that cmore's team is building so we should probably do it

jklukas commented 5 years ago

Indeed, the definition of the relationships KPI specifies “in US., Canada, Germany, France, UK”, so having country info will be necessary to meet that executive definition.

irrationalagent commented 5 years ago

CC @jmccrosky

philbooth commented 5 years ago

This will be much easier to change when we've moved to stackdriver logging. What timescale is it needed in?

jmccrosky commented 5 years ago

We want to use this for a KPI dashboard that the execs want by January 7th. If that's not possible though, it seems there may be other options.

On Thu, Dec 20, 2018 at 5:43 PM Phil Booth notifications@github.com wrote:

This will be much easier to change when we've moved to stackdriver logging. What timescale is it needed in?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mozilla/fxa-activity-metrics/issues/108#issuecomment-449060834, or mute the thread https://github.com/notifications/unsubscribe-auth/Ao8O-HplwSPQnBP6D0cVOrA81uULFa09ks5u674igaJpZM4YjNNY .

philbooth commented 5 years ago

It definitely won't be ready by 7th January fwiw, whichever way we do it. Fixing it requires a deployment of the FxA auth and content servers and there's not one scheduled before then.

jklukas commented 5 years ago

After more discussion with @jmccrosky, it does look like filtering by country is a hard requirement for the KPI dashboard we need to provide in January. What would be the next scheduled deployment of the FxA auth and content servers? Would it be feasible to prep this change before then?

We can likely make due with a provisional graph for a few weeks if we're confident that Redshift will be getting the country info we need by ~end of January. Otherwise, we'll have to put effort into building a stopgap solution, likely building a daily Airflow job to pull JSON logs from Amplitude via their export API, which would likely be several days of engineering effort.

Does the heka filter change seem smaller in scope than building a job to do nightly pulls from Amplitude? Will we only get country info for new messages once a heka filter change goes into place?

On a completely different note, would the raw FxA logs contain all the info we need and potentially be accessible from Databricks? If they're newline delimited JSON with a reasonable schema, that might be an avenue worth exploring.

philbooth commented 5 years ago

I've prepped the logging changes in https://github.com/mozilla/fxa-auth-server/pull/2851 and https://github.com/mozilla/fxa-content-server/pull/6851. These can be reviewed now and deployed in the next FxA train ~~(I still need to find out exactly when that will be, but it will definitely be well before the end of January)~~. This was the easy bit.

The change to the import scripts is ready and waiting in https://github.com/mozilla/fxa-activity-metrics/pull/123, but we can't deploy it until the CSVs include the extra columns otherwise the imports will fail. There's a question for @jbuck in that PR about how to go about this, which should also shed some more light on exactly when the data will be available in Redshift.

So, responding to the questions in your preceding comment, @jklukas:

What would be the next scheduled deployment of the FxA auth and content servers?

We're due to cut FxA train 128 in the week of the 7th so, assuming planetary alignment and no hiccups in QA, it should be live at some point during the week of the 14th.

Would it be feasible to prep this change before then?

Done.

Does the heka filter change seem smaller in scope than building a job to do nightly pulls from Amplitude?

Absolutely, yes.

Will we only get country info for new messages once a heka filter change goes into place?

Yep. Heka change needs to go live and then we need to update our import scripts and then the following day, it should show up in the data.

On a completely different note, would the raw FxA logs contain all the info we need and potentially be accessible from Databricks?

Yes, they would. Either right now, by filtering for our amplitudeEvent log lines, or after train 128 is deployed using activityEvents and/or flowEvents. Those events are all formatted as JSON blobs in the logs. I can't say anything about accessibility from Databricks though, I suspect @jbuck could say more about that part of the question.

jklukas commented 5 years ago

I really appreciate the info and the help here, @philbooth.

Based on this, I'm guessing that a good way forward for the growth dashboard will be to rely on Redshift for the general case. If we decide we need to have data for the first few weeks of January, we can do some sort of one-off manipulation of the raw event logs to create a derived dataset in S3 that serves as a backfill for the events before January.

philbooth commented 5 years ago

Just noting in here that, as per https://github.com/mozilla/fxa-activity-metrics/pull/123#issuecomment-450522740, the change to the Heka filter should be unnecessary because we expect to roll out stackdriver logging in early Jan.

philbooth commented 5 years ago

Moved this to active and put mine and @jbuck's faces on it.

jklukas commented 5 years ago

Some updated info on needs for the growth dashboard: the expectation is that we have results available the week of Jan 11th in order to have guidance prepped for Chris Beard on the 18th, which will then be presented to the board a few days later.

@philbooth - Can you give me a better sense of what happens once stackdriver logging is rolled out? Is there significant effort after that point to get the data into Redshift or BigQuery? Will we get historical info with the stackdriver solution, or only new records?

If we don't feel confident we'll be able to query a production-quality dataset with country info by the week of Jan 11, then I think I'm going to start working on a solution for manually manipulating logs from Amplitude so that we have a stopgap until the new stackdriver-based solution is ready.

philbooth commented 5 years ago

results available the week of Jan 11th

~~I think~~ we're definitely not in a position to make guarantees about that week from our side fwiw.

Can you give me a better sense of what happens once stackdriver logging is rolled out? Is there significant effort after that point to get the data into Redshift or BigQuery?

@jbuck, correct me if I'm wrong, but my understanding is that the first step will be to have stackdriver write to CSV files in S3 just like Heka does, so only a minor tweak to our import scripts will be needed to get the data into Redshift.

Will we get historical info with the stackdriver solution, or only new records?

Only new records will have country data. That column will be null for older data.

If we don't feel confident we'll be able to query a production-quality dataset with country info by the week of Jan 11, then I think I'm going to start working on a solution for manually manipulating logs from Amplitude so that we have a stopgap until the new stackdriver-based solution is ready.

This seems like a prudent step to take, sorry that it means more work on your side though.

jklukas commented 5 years ago

@philbooth - That clarity is helpful. Since the "stopgap" solution will also fulfill our needs for backfilling data from before country is available in the pipeline, I'll start digging into the Amplitude export option today.

jklukas commented 5 years ago

Is this change still pending? We have a working manual process for extracting data from Amplitude for the time being, but we're still interested in switching over to Redshift as the source for keeping the growth dashboard up to date.

philbooth commented 5 years ago

@jklukas yep, @jbuck has been working through some issues with the migration to stackdriver logging that we need to get resolved first. As soon as stackdriver is handling all of our logs without problems, we can change the Redshift import to pull from there and pick up the new properties.

Sorry it's taken longer than originally advertised, it's absolutely still happening though!

irrationalagent commented 5 years ago

~~This would be useful to have done before trailhead.~~ nevermind, see below.

jklukas commented 5 years ago

Note that the KPI dashboard use case is now resolved via the Stackdriver pipeline into BigQuery. We now get the amplitude events (including country) from the fxa prod project in GCP.

That's likely tangential to @irrationalagent's needs, but wanted to make sure it's noted here.

irrationalagent commented 5 years ago

@jklukas are we getting all the user and event properties there as well, e.g. service? If so then I'll retract my comment above.

jklukas commented 5 years ago

are we getting all the user and event properties there as well, e.g. service? If so then I'll retract my comment above

Yes, the "content" table in BigQuery captures event properties and user properties. It has field jsonPayload.event_properties.service, for example.

mozilla / fxa-activity-metrics

Store country in our own data #108