Debigulate the data - Githubissues

mozilla / fxa-activity-metrics

A server for managing the Firefox Accounts metrics database and pipeline

1 stars 3 forks source link

Debigulate the data #104

Closed philbooth closed 6 years ago

philbooth commented 6 years ago

There's a significant proportion of redundant data in our current dataset. We could:

Delete all the flow.continued... events after using them to set the continued_from column in flow_metadata.
Delete all the flow.experiment... events after using them to populate the flow_experiments table.
Delete all the flow.begin and flow.completed events after using them to populate flow_metadata (be careful with this one, it might break some queries).
Delete the strict_multi_device_users import job, which takes aaages and sometimes fails to finish at all.
Send less performance events, just keep the ones we're actually using.

I estimate we could save about 5 gigabytes from our ~7 gigabyte per-day dataset size.

/cc @jbuck

shane-tomlinson commented 6 years ago

I estimate we could save about 5 gigabytes from our ~7 gigabyte per-day dataset size.

Crikey. I knew we were operating at some scale, didn't realize the logs alone were this much.

philbooth commented 6 years ago

Delete all the flow.begin and flow.completed events after using them to populate flow_metadata (be careful with this one, it might break some queries).

I just checked and this one will break a number of queries. They could all be fixed to use flow_metadata columns instead, but that's probably not a fair burden to put on people.

Although, if a query breaks and nobody notices it's broken, does it matter?

Not sure, I'll send an email to canvas opinion.

irrationalagent commented 6 years ago

I'm going to weigh in on this soon. I know I've used flow.completed a fair bit, just need to see where.

irrationalagent commented 6 years ago

So there are a few charts on this dashboard that will break bcs of flow.completed use, as well as queries that aren't on there but are forks of those. Given the huge benefit in terms of space here though, I would be OK with the change. I can just patch those queries to join on flow.metadata. I haven't used flow.begin much, I usually use .view as top-of-funnel.

philbooth commented 6 years ago

@irrationalagent, don't worry about flow.completed, I'm going to keep it. The goal of this issue wasn't to cause upheaval, it was to get whichever speedups could be had without breaking stuff. :smile:

flow.begin is toast though!