tnightengale / dbt-activity-schema

A dbt-Core package for generating models from an activity stream.
GNU General Public License v3.0
38 stars 5 forks source link

Implement Relationships: First In Between, Last In Between #8

Closed tnightengale closed 1 year ago

tnightengale commented 1 year ago

Create the following relationship macros, following the example of the existing relationship macros:

Checklist:

tnightengale commented 1 year ago

CC: @bcodell

tnightengale commented 1 year ago

@bcodell Let's discuss here if you encounter limitations with the dataset.sql macro when implementing these 👍

bcodell commented 1 year ago

@tnightengale does the join logic for first_after actually have the logic for first_in_between? Or am I interpreting it wrong? I would've expected first_after to not have an upper bound on the timestamp join criteria.

bcodell commented 1 year ago

I think I found a bug (I opened a PR but it has a failed test - see here for CI logs or pull down the branch and run dbt build locally to reproduce). Specifically, I believe there's a bug in the logic for how the aggregation_func parameter works when aggregating feature_json columns in most of the defined relationships (first_after, last_before/after, first/last_in_between). Namely, if multiple appended activity records get joined to the primary activity, and the desired value of feature_json doesn't correspond with the min/max aggregation function used by the defined relationship, then the wrong value will be returned in the dataset.

The failed test I linked to mimics that outcome - the expected output for last_in_between_visit_page_feature_json for entity_uuid 1 for dataset__last_in_between_2 should be "[{""visited page"": 1}]", but instead it's "[{""visited page"": 2}]", because that's the max value of all the visit page activity instances that get appended to the primary activity.

For first- and last-value relationships, a workaround is to use the hack described in this article - it's more performant than window functions + filtering.

One way to solve this could be by creating an aggregation_function abstraction that can be assigned to relationship classes. An abstraction like that might be necessary for the aggregate all/before/after/in between relationships anyway. For what it's worth, that's roughly what I did in my own Activity Schema project here.

tnightengale commented 1 year ago

@bcodell Great catch here! I think the timestamp concat hack is a good solution for now. I've pushed up the change to #12. Let's reassess if another abstraction is needed as we add the aggregate between relationships 👍