TRY_CAST SQL compilation error

wylbee commented 1 year ago

Hello!

Appreciate the awesome work on this package and am excited to get into using it.

I'm getting a SQL compilation error when attempting to make use of the package on Snowflake, dbt version is 1.3, dbt-activity-schema version is .32.

My stream is generated using the narrator activity schema dbt package, and the macro I'm using is:

        {{
            dbt_activity_schema.dataset(
                ref("account_stream"),
                dbt_activity_schema.activity(
                    dbt_activity_schema.all_ever(), "originates_loan"
                ),
                [
                    dbt_activity_schema.activity(
                        dbt_activity_schema.last_before(), "updates_autopay"
                    )
                ],
            )
        }}

Variables from the project.yml are:

  dbt_activity_schema:
      included_columns:
        - activity_id
        - customer
        - ts
        - activity
        - anonymous_customer_id
        - feature_json 
#        - feature_json_str
        - link
        - revenue_impact
        - activity_occurrence
        - activity_repeated_at

Error is:

Function TRY_CAST cannot be used with arguments of types TIMESTAMP_TZ(9) and VARCHAR(16777216)

Am I missing something that could get this working as expected? From my own troubleshooting, I see that:

If I revert to .31 and remove the feature_json column from the variable list, I get code that executes.
Based on the relevant Snowflake docs, try_cast needs a string as an input, so I would expect that this also will fail for numbers and the feature_json object on Snowflake. So select try_cast(1 as text) fails but select try_cast('1' as text) works. It looks like the dbt safe_cast macro doesn't work the way I would expect it to here.

If I am not missing something and this is a DB specific issue, I'd be interested in understanding how I can help ensure the package is Snowflake compatible.

tnightengale commented 1 year ago

@brown5628 Thanks for the issue!

This is stemming from a fix in 0.3.2 whereby we realized that the min/max aggregations were not working properly on every column except the ts column and feature_json.

The _min_or_max.sql aggregation macro, is using a trick to prepend the ts column to each column, before taking a min/max and then removing the prepended timestamp. This "give me the value in col_x where ts is min/max" logic, allows the aggregation pattern to remain the same, when using other aggregations like sum and count.

Otherwise, a lot of complexity would have to be introduced to handle pure aggregations, vs "aggregations" that actually require some complex window functions and nested CTEs.

All that being said, the _min_or_max.sql macro uses the dbt Cross database macros to perform the ts prepend, aggregate and concat. However, our CI pipeline is using DuckDB; other adapters have not been officially tested.

Would you be able to look at that macro, and investigate if it can be tweaked to solve the issue for the Snowflake adapter? Either via some correction, or a dispatch to a Snowflake specific implementation? Eg"


{% macro snowflake__min_or_max() %}

...

{% endmacro %}

wylbee commented 1 year ago

@tnightengale Took a spin at this and am noting the following:

Snowflake's safe_cast requires a string input and restricts the data type options to cast to (specifically cannot cast to variant/semi-structured data).
I don't see a path to a clean swap out of the dbt Cross DB Macro for safe_cast.
- Swapping in a cast to varchar gets the code functioning for all but the feature_json column - i.e. dbt.safe_cast(qualified_ts_col + "::varchar" , dbt.type_string()) or rewriting with Snowflake's cast function.
- Given the existing code, feature_json from the appended activities are coming through as a string. Snowflake has a specific function called parse_json that I think would need to be employed here, as casting to variant doesn't achieve expected results when combined with the prepended timestamp removal. I think that this would require additional logic for just the feature_json column to meet expectations for a Snowflake user. Could be missing something there though.
It looks like the purpose of this code is to max or min by a timestamp to select a specific value. If so, Snowflake has built in functions for that (min_by and max_by).

Given all of that, I'm leaning towards needing a Snowflake specific implementation over a correction to the existing code. From a maintainability perspective, do you have a preference between adapting the prepend approach vs. Snowflake's min_by & max_by? Given the choice to go with a specific implementation over a correction I'm leaning towards min_by & max_by to leverage the DB specific approach. Pseudo-code would be something like:

{% macro _min_or_max(min_or_max, qualified_col) %}

{% set aggregation = "min_by" if min_or_max == "min" else "max_by" %}
{% set qualified_ts_col = "{}.{}".format(
    dbt_activity_schema.appended(), dbt_activity_schema.columns().ts
) %}

{# Apply min or max by to the selected column. #}
{% set aggregated_col %}
    {{ aggregation }}({{ qualified_col }}, {{ qualified_ts_col }})
{% endset %}

{# Return output. #}
{% do return(aggregated_col) %}

{% endmacro %}

Based on your thoughts, I'm happy to take a first cut on code but am unfamiliar with the process of contributing to a project like this. Would the best next step there to be to fork & tee up a pull request with the changes?

bcodell commented 1 year ago

Hey @brown5628 - I've been collaborating on this project with Teghan as well. Thanks for working on this! Teghan may have some opinions around how to best organize the code, but the Snowflake-specific implementation you proposed using built-in min_by and max_by functions makes sense to me. And regarding contributions, a fork + PR works.

Nice job spotting the edge case with casting feature_json to variant as well! Out of curiosity, what use cases do you have in mind for appending the full feature_json blob to a dataset, instead of parsing out individual features?

For reference, this question is more project management-y in nature - I'm wondering if specifying feature_json as an appended column should default to appending all parsed features for the appended activity, rather than appending the feature_json column itself.

wylbee commented 1 year ago

Thanks @bcodell.

Think your intuition on use is spot on- thus far 100% of the time I parse out individual features and cannot think of a pattern where I would want to retain the feature_json blob.

bcodell commented 1 year ago

FYI @brown5628 - while you work on the PR for this issue, I believe a workaround is now in place (from #34) which now allows users to specify json columns by their key names in the included_columns argument (i.e. instead of including the entire feature_json, which was causing this bug). I have a PR (#35) up for a new version release - while waiting for that to be merged, you can install the package pinned to this branch to use.

InbarShirizly commented 11 months ago

Hey @brown5628 , @bcodell , @tnightengale I just tried using this package and I face similar issue with snowflake, I can't run a single dataset model since I face: Function TRY_CAST cannot be used with arguments of types TIMESTAMP_TZ(9) and VARCHAR(16777216)

I can see that there is an open PR #32 that might help here

is it planned to get merged soon?

thanks

bcodell commented 11 months ago

Hey @InbarShirizly thanks for the ping - based on my previous comment, I think there's a workaround to explicitly declare the columns you want in the included_columns argument. But for sake of transparency, there hasn't been much maintenance on this project in the last several months, and I ended up spinning off and building my own dbt Activity Schema management package (here), so I'm no longer familiar enough with that open PR to be comfortable merging it.

wylbee commented 11 months ago

@InbarShirizly The read @bcodell shared matches my own. The PR got stuck on how the contribution was structured, but the code itself works & produces expected results per the documented testing approach. I don't have an expectation that it will be merged since this project hasn't seen activity in months.

If you're interested, I'd recommend either trying the included_columns approach or loading the fork by adding this to your packages.yml file:

  - git: "https://github.com/brown5628/dbt-activity-schema-narrator"

I've been using the forked version in my Snowflake environment since April with no issues.

tnightengale / dbt-activity-schema

TRY_CAST SQL compilation error #26