oracle / dbt-oracle

dbt (data build tool) adapter for Oracle Autonomous Database
https://docs.getdbt.com/docs/core/connect-data-platform/oracle-setup
Apache License 2.0
49 stars 17 forks source link

[Bug] hash_collissions in dbt snapshot #154

Open espenhoh opened 2 months ago

espenhoh commented 2 months ago

Is there an existing issue for this?

Current Behavior

snapshot is implemented with the ora_hash() function. This gives a lot of collissions in a large data set and duplicate ids, causing errors in the merge statement for data volumes of millions of rows. dbt snapshot will fail once a hash collission appears, and the snapshot tables becomes impossible to update.

Expected Behavior

No collissions,

Steps To Reproduce

run dbt snapshot command on data with a few million updates. First dbt snapshot command succeeds as there are only inserts, but duplicates in dbt_scd_id exists causing trouble on subsequest runs.

Relevant log output using --debug flag enabled

06:08:52    Database Error in snapshot table_name (snapshots\table_name.sql)
  ORA-30926: unable to get a stable set of rows in the source tables
  Help: https://docs.oracle.com/error-help/db/ora-30926/
  compiled Code at target\run\dbt_project\snapshots\table_name.sql

Environment

- OS: Windows
- Python: 3.11
- dbt: 1.8.1

What Oracle database version are you using dbt with?

19c

Additional Context

No response

aosingh commented 2 months ago

Hi @espenhoh

I understand the issue because we have seen these in the past.

52

102

And yes, we decided not to update ORA_HASH function because it would break existing snapshots, Users who are starting fresh snapshots should include the following macro in their dbt project. This uses a much stronger hash function - SHA256

{% macro oracle__snapshot_hash_arguments(args) -%}
    STANDARD_HASH({%- for arg in args -%}
        coalesce(cast({{ arg }} as varchar(4000) ), '')
        {% if not loop.last %} || '|' || {% endif %}
    {%- endfor -%}, 'SHA256')
{%- endmacro %}

For existing snapshots, to migrate, there is an option.

dbt snapshot uses MERGE INTO statement.

MERGE INTO target t USING staging s ON (s.dbt_scd_id = d.dbt_scd_id)

So, before using the new hash function you need to UPDATE the column dbt_scd_id in the snapshot table using the STANDARD_HASH function.

UPDATE <SNAPSHOT_TABLE> 
SET dbt_scd_id = STANDARD_HASH(<args>, 'SHA256')