Open bastienboutonnet opened 2 years ago
@JCZuurmond is that something you might want to look into once the PR you currently have is merged?
Sure, I do not understand the issue fully yet.
General questions: the goal is to send rows for which the test failed to the Soda cloud? And, we do this automagically if the store-failures
flag is true
?
If so, I think we should not do this automagically, because the failed records are stored in the warehouse in which the data already resides, which is (legally) different from sending it to a new system (Soda cloud). I think we should enable this with a flag from the soda side, like the samples
, and make it independent of dbt's store-failures
.
want to ingest the select statement that dbt produces when store_failures is enabled
The select
statement mentioned here is the compiled_sql
field, right?
The test results are stored in the dbt_test__audit
schema/database when the store-failures
is enabled. I do not know if this name can be customized, what is the case anyway: dbt prefixes the schema with the target name - by default and this logic can be customized. Given that the schema is dynamically changed, the target name is not present in the manifest (I think) and that this logic can be customized, I think we can not use the dbt_test__audit
tables which already contain the failed rows. This means that we have to re-execute the sql to retrieve the failed rows.
The approach I see that could work is:
Some remarks here are:
samples
from Soda: only the columns relevant to the test are selected and depending on the test it might not look like the original table schema at all.ingest
now only requires the manifest.Altogether, I am ok with these remarks, but we should be clear about them. A nicer solution would be to find a way to replace the dbt test
with a (dbt) soda test
command completely.
@JCZuurmond at this stage we simply want to ingest the select statement and show it to users. We're not going to be retrieving these failed rows (at least for now).
If we want to do a more tight integration similar to https://github.com/dmateusp/dbt-opentelemetry and what not then we will adjust this but this is out of scope for this MVP.
The relation_name
field of the manifest entry contains the table in which the failed rows would be stored from what I could gather:
"relation_name":"intelligence_dwh.dbt_test__audit.unique_base_soda_sql_events_trace_id",
so we could generate f"select * from {manifest['relation_name']}
or something similar.
Does this make sense?
Ah, nice! Well spotted. We could do that, yes. Still note my remark about that these tables are different from the Soda samples, they contain the output of the SQL used for testing, which does not include the whole row for which the test failed.
Where do we add this info? To the Test
or TestResult
object?
@JCZuurmond by the info you mean the select statement right?
We have a "Failed Rows" section in which we can generate a message. Check the Slite page link I shared with you. Scroll down quite a bit and there is a section called "Failed rows handling" where @Antoninj explains the flow. It relies on another API dedicated to failed rows
It is similar to sending failed rows from a soda SQL test so I think there should be code that you can hook into.
@vijaykiran should know.
If it's still not clear feel free to ping him.
@JCZuurmond by the info you mean the select statement right?
yes
Ok, and what do you think about the proposal?
I do not see the benefit of showing this 'select' statement atm.
How I see it: we can benefit from the stored failures by not having to re-compute for which rows a test fails. We get those rows from the stored failures instead.
However, if we only show the SQL to select the failed rows, why not simply show the SQL of the test that filtered those failed rows in the first place?
The downside of the failed rows is that inconsistencies might occur when data changes after execution of the test. If you execute the SQL that is the test, you known that you always look at the latest data.
I see your points @JCZuurmond here are some counter points that I think should help us make a final call:
we can benefit from the stored failures by not having to re-compute for which rows a test fails. We get those rows from the stored failures instead.
Right now, this mimics the flow that a dbt user would be familiar with. That is:
select * from failures
Doing this allows us to get on full parity with the dbt functionality without having to build any extra logic which would actually call this select
statement and push the failing rows into our UI. This mechanism could be something we look into building as a follow up but is not, in my opinion, necessary for us to ship a first version of this integration.
However, if we only show the SQL to select the failed rows, why not simply show the SQL of the test that filtered those failed rows in the first place?
It will also be shown in the failed_row
when users do not have the store_failures
turned on. I think when people have the failed rows, it would be a shame not to go all the way as dbt does and surface the select
statement where those are stored.
I hope this makes the motivations a bit clearer.
@vijaykiran you were going to work on adding the failed rows change today, correct? If so, after those points are raised how do you feel about it?
Ok, if we want to show the SQL that filters the rows of a certain test, I think it makes sense to go this route:
SELECT * FROM <stored failures>
.1.
is not present: show compiled SQL of test.Awesome! And down the line we might consider pumping the failed rows from that select into the UI but I would see it as a separate exercice.
As part of sodadata/soda-sql#160 we will want to ingest the
select
statement that dbt produces whenstore_failures
is enabled.This select statement, just like the compiled SQL, can be found in the
manifest.json
and retrievable by using the test unique id.The
manifest.json
info for the test whenstore_failures
is enabled looks something like this and the name of the relation where the failures was stored is contained in therelation_name
field which we can use to doselect * from <relation_name>
. That being said I only had a very quick look so do check.The ingestion flow for this statement should leverage the
noFileMessage
property from thesodaSqlScanFile
mechanism (cf https://github.com/sodadata/soda/issues/3265: