meltano / squared

Where the Meltano team runs Meltano! Get it???
25 stars 6 forks source link

Superset Dashboard of Pipeline Runs, Unique Pipelines, and Projects with progressive filters applied #406

Closed pnadolny13 closed 2 years ago

pnadolny13 commented 2 years ago

Pipeline Runs:

Unique Pipelines and Projects

pnadolny13 commented 2 years ago

@tayloramurphy can you say a little more about what you'd like to see here? Just so were on the same page. I put the issue comment in during our 1-1 and its not 100% clear now what we want to see. Are we hoping to make this a "pipelines dashboard"?

What filters were you thinking? I'll explore and add what I think is useful as well then we can review the draft together but want to make sure your top requests are in there.

tayloramurphy commented 2 years ago

@pnadolny13 there's a couple of ways to do this. One of the main goals should be a clear, single dashboard that shows "this is our current best definition of each metric: Active Projects, Pipeline Runs, and Unique Pipelines.

But I also want to see filtered views of each of those. They could be separate charts, separate series on the same chart, or selectable filters across all charts. Projects Pipeline Runs Unique Pipelines
Raw Counts, no filters
Only from Active Project
Only from Projects with >= 1 Pipeline Run
Only from Whale projects

In the above example, the ✅ is the best definition for each metric, but it's also part of the data filtering journey.

Does that make sense?

pnadolny13 commented 2 years ago

@tayloramurphy yep that makes sense. I have a draft going here.

We have a sort of mixing of aggregations here with active using a 28 day window and aggregates in superset being rolled up at the month level. So I wanted to explain the way I'm thinking about it to make sure were on the same page. I'm using the 28 day window to flag a project as active at the day level. I'm then applying that active attribute to the cli executions fact table based on date it was run so we can filter for executions that happened when a project was considered active. This means that if I do a distinct count of project IDs in a month where their active flag was true at the month level, I end up getting a slightly different number than the daily PxP active projects number on a particular day that month. One is saying there are x active projects today or x on 8/19/22 and the other is saying there were x active projects that executed plugins in August or active projects contributed 90% of pipeline runs in August. Does that distinction make sense?

Previously we were reporting total active project counts on a specific day of the 28 day rolling window aggregate. I think its better to stop doing that and start treating the active status as an attribute we use for filtering but aggregate to any grain we want. We might eventually switch to a 14 day window or 60 day window, which wouldnt cause any problems. Or we might want to aggreagate pipeline runs at the week level vs the month level, "how many active projects ran a pipeline on the first week of August vs across the whole month".

All of that was to preface a question I had about Only from Projects with >= 1 Pipeline Run. At what aggregation level should I be summing pipeline runs? I would assume at the calendar month level. I dont know a way to aggregate that on the fly in Superset so I need to preaggregate that in dbt and have an attribute like monthly_pipeline_runs integer we filter by or monthly_pipeline_runs_greater_than_1 (or something) boolean but it gets a little tricky because then you can't filter for active projects in the dashboard for that chart because not all of those pre-aggregated pipeline runs are necessarily active e.g. 7 days before the end of the month I create a project and run 1k pipelines, I'm not considered active for any day that month so I had 1k pipeline runs but 0 active pipeline runs. If we want to distinguish between the two then that leads me to want to have monthly_pipeline_runs and monthly_pipeline_runs_active, etc. etc. which isnt the best. Do you follow that?

If we want this to be a funnel then only having the monthly_pipeline_runs_active integer works for me, its a subset of active projects and it can be used to filter >1 or for whales can be a further subset of that where we filter >10k (or whatever we decide is the number).

pnadolny13 commented 2 years ago

The above comment is similar to https://github.com/meltano/internal-data/issues/24 also. How should we evaluate their archetype segment? It makes me think that these should also be an attribute that gets aggregated daily with a sliding window. If I create a project on the first of the month, 7 days later I'm an active gumpy, then a marlin, then a on the last day of the month I'm a whale. If we want a count of pipelines from whales then would this project only contribute 1 day worth of pipelines because they reached whale status on the last day of the month? Or are we projecting the max segment status (aka whale) across the whole month for that project once they reach it.

tayloramurphy commented 2 years ago

@pnadolny13 I think that all makes sense.

To your last point, the general question we're trying to answer with the > N pipeline runs is the bucketing of a project into the archetypes we've discussed. So it's less the specific filter of > 1, but the real question is 1-50, 50-500, 500-10k, etc buckets. As for how we do it, I think mapping our potential billing model would make sense.

Assuming we charge on a per pipeline run basis then we should probably just look at it monthly. So the "1-50 pipeline runs" bucket would mean "in X month, no matter when I started, if I ran between 1 and 50 pipelines I'm in this bucket".

If we want a count of pipelines from whales then would this project only contribute 1 day worth of pipelines because they reached whale status on the last day of the month? Or are we projecting the max segment status (aka whale) across the whole month for that project once they reach it.

We'd project the whale status across the entire month. On a daily basis we could report the number of active projects in a given bucket, but after a month is finished we should be able to settle the numbers for the previous month and report at the end of the month the overall distribution. (That said, if we have a 7 day window, it would take until the 6th of the month for us to "close out" the numbers from the previous month. So possibly an argument against that, but I'm okay with it for now).

pnadolny13 commented 2 years ago

To your last point, the general question we're trying to answer with the > N pipeline runs is the bucketing of a project into the archetypes we've discussed. So it's less the specific filter of > 1, but the real question is 1-50, 50-500, 500-10k, etc buckets. As for how we do it, I think mapping our potential billing model would make sense.

@tayloramurphy sounds good - I'm planning to close out this issue because https://github.com/meltano/internal-data/issues/24 needs to be completed before the rest of these filters can be added. I created https://github.com/meltano/squared/issues/420 to add those filters once theyre available.

We'd project the whale status across the entire month. On a daily basis we could report the number of active projects in a given bucket...

We can do this but I feel like month grain aggregations for archetype status and daily aggregations for active status are at odds. I dont think we'd get anything useful out of looking at archetype attribution until the month is over because someone who has been a whale every month for a year might not be a whale until the last week of the month so looking before the month is closed would probably be misleading.

That said, if we have a 7 day window, it would take until the 6th of the month for us to "close out" the numbers from the previous month. So possibly an argument against that, but I'm okay with it for now

Not in the way I'm implementing it. Active status is attributed daily and assigned to individual executions. So theres no lag in attribution.

This is where I think we might be on different pages and why I wanted to explain my implementation in detail to make sure it matches the expectations. The way I'm treating active users right now is that everything they do in their first 7 days is not counted when we filter for active projects, they were not active during that period so pipeline runs arent included in any active pipeline run metrics. Once they reach their 7 day mark they become active and their pipeline runs start contributing to the metrics. I'm not using logic where after the 7 day mark they become active and all pipeline executions before then get unlocked to contribute to the active pipeline run metrics. So theres no need to wait 7 days to close out the month. Do you agree with that?

If we consider <7 day activity to be exploratory and excluded then I think its best to include those pipeline runs in total numbers but not in active numbers, they were inactive and still exploring when they ran those pipelines.