mozilla / fxa-activity-metrics

A server for managing the Firefox Accounts metrics database and pipeline
1 stars 3 forks source link

Debigulate the data #104

Closed philbooth closed 6 years ago

philbooth commented 6 years ago

There's a significant proportion of redundant data in our current dataset. We could:

I estimate we could save about 5 gigabytes from our ~7 gigabyte per-day dataset size.

/cc @jbuck

shane-tomlinson commented 6 years ago

I estimate we could save about 5 gigabytes from our ~7 gigabyte per-day dataset size.

Crikey. I knew we were operating at some scale, didn't realize the logs alone were this much.

philbooth commented 6 years ago
  • Delete all the flow.begin and flow.completed events after using them to populate flow_metadata (be careful with this one, it might break some queries).

I just checked and this one will break a number of queries. They could all be fixed to use flow_metadata columns instead, but that's probably not a fair burden to put on people.

Although, if a query breaks and nobody notices it's broken, does it matter?

Not sure, I'll send an email to canvas opinion.

irrationalagent commented 6 years ago

I'm going to weigh in on this soon. I know I've used flow.completed a fair bit, just need to see where.

irrationalagent commented 6 years ago

So there are a few charts on this dashboard that will break bcs of flow.completed use, as well as queries that aren't on there but are forks of those. Given the huge benefit in terms of space here though, I would be OK with the change. I can just patch those queries to join on flow.metadata. I haven't used flow.begin much, I usually use .view as top-of-funnel.

philbooth commented 6 years ago

@irrationalagent, don't worry about flow.completed, I'm going to keep it. The goal of this issue wasn't to cause upheaval, it was to get whichever speedups could be had without breaking stuff. :smile:

flow.begin is toast though!