Open espenhoh opened 2 months ago
Hi @espenhoh
I understand the issue because we have seen these in the past.
And yes, we decided not to update ORA_HASH
function because it would break existing snapshots, Users who are starting fresh snapshots should include the following macro in their dbt project. This uses a much stronger hash function - SHA256
{% macro oracle__snapshot_hash_arguments(args) -%}
STANDARD_HASH({%- for arg in args -%}
coalesce(cast({{ arg }} as varchar(4000) ), '')
{% if not loop.last %} || '|' || {% endif %}
{%- endfor -%}, 'SHA256')
{%- endmacro %}
For existing snapshots, to migrate, there is an option.
dbt snapshot uses MERGE INTO statement.
MERGE INTO target t USING staging s ON (s.dbt_scd_id = d.dbt_scd_id)
So, before using the new hash function you need to UPDATE the column dbt_scd_id
in the snapshot table using the STANDARD_HASH function.
UPDATE <SNAPSHOT_TABLE>
SET dbt_scd_id = STANDARD_HASH(<args>, 'SHA256')
Is there an existing issue for this?
Current Behavior
snapshot is implemented with the ora_hash() function. This gives a lot of collissions in a large data set and duplicate ids, causing errors in the merge statement for data volumes of millions of rows. dbt snapshot will fail once a hash collission appears, and the snapshot tables becomes impossible to update.
Expected Behavior
No collissions,
Steps To Reproduce
run dbt snapshot command on data with a few million updates. First dbt snapshot command succeeds as there are only inserts, but duplicates in dbt_scd_id exists causing trouble on subsequest runs.
Relevant log output using
--debug
flag enabledEnvironment
What Oracle database version are you using dbt with?
19c
Additional Context
No response