bastienboutonnet commented 2 years ago

As part of sodadata/soda-sql#160 we will want to ingest the select statement that dbt produces when store_failures is enabled.

This select statement, just like the compiled SQL, can be found in the manifest.json and retrievable by using the test unique id.

The manifest.json info for the test when store_failures is enabled looks something like this and the name of the relation where the failures was stored is contained in the relation_name field which we can use to do select * from <relation_name>. That being said I only had a very quick look so do check.

{
   "test.intelligence_dwh.unique_base_soda_sql_events_trace_id.eb689ebf18":{
      "raw_sql":"{{ test_unique(**_dbt_generic_test_kwargs) }}",
      "test_metadata":{
         "name":"unique",
         "kwargs":{
            "column_name":"trace_id",
            "model":"{{ get_where_subquery(ref('base_soda_sql_events')) }}"
         },
         "namespace":null
      },
      "compiled":true,
      "resource_type":"test",
      "depends_on":{
         "macros":[
            "macro.dbt.test_unique",
            "macro.dbt.get_where_subquery",
            "macro.dbt.should_store_failures",
            "macro.dbt.statement"
         ],
         "nodes":[
            "model.intelligence_dwh.base_soda_sql_events"
         ]
      },
      "config":{
         "enabled":true,
         "alias":null,
         "schema":"dbt_test__audit",
         "database":null,
         "tags":[

         ],
         "meta":{

         },
         "materialized":"test",
         "severity":"ERROR",
         "store_failures":true,
         "where":null,
         "limit":null,
         "fail_calc":"count(*)",
         "warn_if":"!= 0",
         "error_if":"!= 0"
      },
      "database":"intelligence_dwh",
      "schema":"dbt_test__audit",
      "fqn":[
         "intelligence_dwh",
         "soda_sql_telemetry",
         "unique_base_soda_sql_events_trace_id"
      ],
      "unique_id":"test.intelligence_dwh.unique_base_soda_sql_events_trace_id.eb689ebf18",
      "package_name":"intelligence_dwh",
      "root_path":"/Users/bastienboutonnet/repos/intelligence_analytics/intelligence_dwh",
      "path":"unique_base_soda_sql_events_trace_id.sql",
      "original_file_path":"models/soda_sql_telemetry/schema.yaml",
      "name":"unique_base_soda_sql_events_trace_id",
      "alias":"unique_base_soda_sql_events_trace_id",
      "checksum":{
         "name":"none",
         "checksum":""
      },
      "tags":[

      ],
      "refs":[
         [
            "base_soda_sql_events"
         ]
      ],
      "sources":[

      ],
      "description":"",
      "columns":{

      },
      "meta":{

      },
      "docs":{
         "show":true
      },
      "patch_path":null,
      "compiled_path":"target/compiled/intelligence_dwh/models/soda_sql_telemetry/schema.yaml/unique_base_soda_sql_events_trace_id.sql",
      "build_path":"target/run/intelligence_dwh/models/soda_sql_telemetry/schema.yaml/unique_base_soda_sql_events_trace_id.sql",
      "deferred":false,
      "unrendered_config":{
         "store_failures":true
      },
      "created_at":1638877370.9161742,
      "compiled_sql":"\n    \n    \n\nselect\n    trace_id as unique_field,\n    count(*) as n_records\n\nfrom intelligence_dwh.soda_sql_telemetry.base_soda_sql_events\nwhere trace_id is not null\ngroup by trace_id\nhaving count(*) > 1\n\n\n",
      "extra_ctes_injected":true,
      "extra_ctes":[

      ],
      "relation_name":"intelligence_dwh.dbt_test__audit.unique_base_soda_sql_events_trace_id",
      "column_name":"trace_id",
      "file_key_name":"models.base_soda_sql_events"
   }
}

The ingestion flow for this statement should leverage the noFileMessage property from the sodaSqlScanFile mechanism (cf https://github.com/sodadata/soda/issues/3265:

bastienboutonnet commented 2 years ago

@JCZuurmond is that something you might want to look into once the PR you currently have is merged?

JCZuurmond commented 2 years ago

Sure, I do not understand the issue fully yet.

The why

General questions: the goal is to send rows for which the test failed to the Soda cloud? And, we do this automagically if the store-failures flag is true?

If so, I think we should not do this automagically, because the failed records are stored in the warehouse in which the data already resides, which is (legally) different from sending it to a new system (Soda cloud). I think we should enable this with a flag from the soda side, like the samples, and make it independent of dbt's store-failures.

The how

want to ingest the select statement that dbt produces when store_failures is enabled

The select statement mentioned here is the compiled_sql field, right?

The test results are stored in the dbt_test__audit schema/database when the store-failures is enabled. I do not know if this name can be customized, what is the case anyway: dbt prefixes the schema with the target name - by default and this logic can be customized. Given that the schema is dynamically changed, the target name is not present in the manifest (I think) and that this logic can be customized, I think we can not use the dbt_test__audit tables which already contain the failed rows. This means that we have to re-execute the sql to retrieve the failed rows.

The approach I see that could work is:

Get the compiled sql from the manifest
Run the sql against the warehouse
Fetch the output
Send the output to the Soda cloud

Some remarks here are:

As noted before, this means that that query is executed twice: once by dbt and once by soda.
Because there could be a (big) delay between when the test a ran by dbt and ingested to soda, the data might have changed in between, thus maybe resulting different rows
The compiled SQL is not exactly the same as the samples from Soda: only the columns relevant to the test are selected and depending on the test it might not look like the original table schema at all.
it requires a working warehouse connection, where the ingest now only requires the manifest.

Altogether, I am ok with these remarks, but we should be clear about them. A nicer solution would be to find a way to replace the dbt test with a (dbt) soda test command completely.

bastienboutonnet commented 2 years ago

@JCZuurmond at this stage we simply want to ingest the select statement and show it to users. We're not going to be retrieving these failed rows (at least for now).

If we want to do a more tight integration similar to https://github.com/dmateusp/dbt-opentelemetry and what not then we will adjust this but this is out of scope for this MVP.

The relation_name field of the manifest entry contains the table in which the failed rows would be stored from what I could gather:

"relation_name":"intelligence_dwh.dbt_test__audit.unique_base_soda_sql_events_trace_id",

so we could generate f"select * from {manifest['relation_name']} or something similar.

Does this make sense?

JCZuurmond commented 2 years ago

Ah, nice! Well spotted. We could do that, yes. Still note my remark about that these tables are different from the Soda samples, they contain the output of the SQL used for testing, which does not include the whole row for which the test failed.

Where do we add this info? To the Test or TestResult object?

bastienboutonnet commented 2 years ago

@JCZuurmond by the info you mean the select statement right?

We have a "Failed Rows" section in which we can generate a message. Check the Slite page link I shared with you. Scroll down quite a bit and there is a section called "Failed rows handling" where @Antoninj explains the flow. It relies on another API dedicated to failed rows

It is similar to sending failed rows from a soda SQL test so I think there should be code that you can hook into.

@vijaykiran should know.

If it's still not clear feel free to ping him.

JCZuurmond commented 2 years ago

@JCZuurmond by the info you mean the select statement right?

yes

bastienboutonnet commented 2 years ago

Ok, and what do you think about the proposal?

JCZuurmond commented 2 years ago

I do not see the benefit of showing this 'select' statement atm.

How I see it: we can benefit from the stored failures by not having to re-compute for which rows a test fails. We get those rows from the stored failures instead.

However, if we only show the SQL to select the failed rows, why not simply show the SQL of the test that filtered those failed rows in the first place?

The downside of the failed rows is that inconsistencies might occur when data changes after execution of the test. If you execute the SQL that is the test, you known that you always look at the latest data.

bastienboutonnet commented 2 years ago

I see your points @JCZuurmond here are some counter points that I think should help us make a final call:

we can benefit from the stored failures by not having to re-compute for which rows a test fails. We get those rows from the stored failures instead.

Right now, this mimics the flow that a dbt user would be familiar with. That is:

run tests
get a message that some tests failed and that they can consult the failures at select * from failures

Doing this allows us to get on full parity with the dbt functionality without having to build any extra logic which would actually call this select statement and push the failing rows into our UI. This mechanism could be something we look into building as a follow up but is not, in my opinion, necessary for us to ship a first version of this integration.

However, if we only show the SQL to select the failed rows, why not simply show the SQL of the test that filtered those failed rows in the first place?

It will also be shown in the failed_row when users do not have the store_failures turned on. I think when people have the failed rows, it would be a shame not to go all the way as dbt does and surface the select statement where those are stored.

I hope this makes the motivations a bit clearer.

@vijaykiran you were going to work on adding the failed rows change today, correct? If so, after those points are raised how do you feel about it?

JCZuurmond commented 2 years ago

Ok, if we want to show the SQL that filters the rows of a certain test, I think it makes sense to go this route:

Show SELECT * FROM <stored failures>.
Fall back if 1. is not present: show compiled SQL of test.

bastienboutonnet commented 2 years ago

Awesome! And down the line we might consider pumping the failed rows from that select into the UI but I would see it as a separate exercice.

sodadata / soda-sql

Ingest the `select` statement that dbt provides when `store_failures` is enabled. #165

The why

The how