snowplow / dbt-snowplow-unified

A fully incremental model, that transforms raw web & mobile event data generated by the Snowplow JavaScript & mobile trackers into a series of derived tables of varying levels of aggregation.
Other
12 stars 5 forks source link

snowplow_unified_sessions fails to run on Snowflake due to duplicates #90

Open jacobus-herman opened 2 weeks ago

jacobus-herman commented 2 weeks ago

Describe the bug

We receive an error when running the snowplow_unified_sessionsmodel in Production on Snowflake. The error is as follows:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql)
100090 (42P18): Duplicate row detected during DML action
Row Values: INTERNAL ERROR
compiled code at /app/target/run/snowplow_unified/models/sessions/snowplow_unified_sessions.sql

This error seems to be due to be linked to the mobile_language_name column that was added and the fact that deduplication is only done for postgres and spark but not snowflake.

Steps to reproduce

Set the dbt_project.yml to the following:

vars:
  snowplow_unified:
    snowplow__events_table: ...
    snowplow__atomic_schema: ...
    snowplow__enable_iab: true
    snowplow__enable_ua: true
    snowplow__enable_yauaa: true
    snowplow__enable_mobile_context: true
    snowplow__enable_application_context: true
    snowplow__start_date: '2024-06-01'
    snowplow__backfill_limit_days: 30
    snowplow__session_timestamp: load_tstamp
    snowplow__session_stitching: false

Execute dbt run --select package:snowplow_unified.

Expected results

Expected no errors from the snowplow_unified_sessions model.

Actual results

Receiving an error:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql)
100090 (42P18): Duplicate row detected during DML action
Row Values: INTERNAL ERROR
compiled code at /app/target/run/snowplow_unified/models/sessions/snowplow_unified_sessions.sql

Screenshots and log output

System information

The contents of your packages.yml file:

packages:
  - package: dbt-labs/dbt_utils
    version: 1.3.0
  - package: dbt-labs/codegen
    version: 0.12.1
  - package: dbt-labs/dbt_project_evaluator
    version: 0.14.3
  - package: dbt-labs/audit_helper
    version: 0.12.0
  - package: calogica/dbt_expectations
    version: 0.10.4
  - package: dbt-labs/dbt_external_tables
    version: 0.9.0
  - package: Snowflake-Labs/dbt_constraints
    version: [">=1.0.0", "<1.1.0"]
  - package: snowplow/snowplow_unified
    version: 0.5.0
  - package: elementary-data/elementary
    version: 0.16.1

Which database are you using dbt with?

The output of dbt --version:

Core:
  - installed: 1.8.8
  - latest:    1.8.8 - Up to date!

Plugins:
  - snowflake: 1.8.4 - Up to date!

The operating system you're using: macOS 15.0.1 (24A348) OR Debian Bookworm (python:3.9-slim Docker image)

The output of python --version: Python 3.9.19

Additional context

Possible problem with line of code here: https://github.com/snowplow/dbt-snowplow-unified/blob/1223a5f9310ff5c215970309c4cedd766eaa1130/models/sessions/scratch/snowplow_unified_sessions_this_run.sql#L465 and here: https://github.com/snowplow/dbt-snowplow-unified/blob/1223a5f9310ff5c215970309c4cedd766eaa1130/models/sessions/scratch/snowplow_unified_sessions_this_run.sql#L485

Are you interested in contributing towards the fix?

Unfortunately, I have no time to do it directly.

jacobus-herman commented 2 weeks ago

We also found a duplicate in the seed file here. This is probably not intended.

dkbrkjni commented 3 days ago

I think this is the same issue as we see in Databricks after updatint to version 0.5.0.

This is the log from a failed run in Databricks:

Database Error in model snowplow_unified_sessions (models/sessions/snowplow_unified_sessions.sql) [DELTA_MULTIPLE_SOURCE_ROW_MATCHING_TARGET_ROW_IN_MERGE] Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. By SQL semantics of Merge, when multiple source rows match on the same target row, the result may be ambiguous as it is unclear which source row should be used to update or delete the matching target row. You can preprocess the source table to eliminate the possibility of multiple matches.