tnightengale / dbt-activity-schema

A dbt-Core package for generating models from an activity stream.
GNU General Public License v3.0
38 stars 5 forks source link

Refactor Join Requirements for Aggregate All Ever #36

Open bcodell opened 1 year ago

bcodell commented 1 year ago

Description

The aggregate_all_ever relationship can cause exploding joins if self-joins an activity with high cardinality. For example, on a customer stream with a placed_order activity, a dataset with primary activity placed_order and appended activity aggregate_all_ever_placed_order would create such an issue.

Proposal

Add conditional logic in the dataset macro:

bcodell commented 1 year ago

@tnightengale ran into this in my own version of the project, and the proposed fix worked great - a dataset with >30MM rows that wasn't materializing in 20+ minutes is now running successfully in 30 seconds. This issue definitely creates a scaling problem for users, so good to resolve quickly. I can take a stab at a fix, but let me know if you have any questions.

tnightengale commented 1 year ago

100% let's fix this asap 👍

bcodell commented 1 year ago

Sounds good! I'm on vacation the rest of this week but can tackle next week. If you want to get this fixed sooner, though, have at it! For reference, My approach in my own project was to add conditional logic based on the relationship of the appended activity. For everything but aggregate_all, use the existing join logic. Otherwise, don't join back to the primary CTE - just select the appended activity and aggregate on its own customer column. Then add the same conditional logic in the final join statement so that those CTEs join on customer instead of activity_id