Closed bastienboutonnet closed 2 years ago
@bastienboutonnet : thanks for doing this write up!
My plan is as follows:
ingest
command to the climanifest.json
and run_results.json
to get the required test and model info from the dbt artifacts.Map the parsed dbt tests and models to a Soda test result
test_run_at
<- TimingInfo
test_result
<- TestStatus
database
and schema
<- HasRelationMetadata
.table
<- ParsedNodeMandatory.alias
compiled_sql
<- RemoteCompileResultMixin
severity
<- TestConfig.Severity
Tests and models are matched as follows:
run_results
manifest
using the unique_id
depends_on
manifest
using the unique_id
Hey @JCZuurmond, I think your plan sounds really good to me!
Where I see there will be a little back and forth might be around the mapping work for two reasons:
ParsedNodeMandatory.alias
sounds strange to me that it's an alias, but it may just be strange naming. As far as I know the manifest.json
has fields for database
, schema
and another one for the actual table name but since you're using dbt
internals to parse it it may end up there. I haven't gone deep into their objects. But if you checked then its fine.TestResult
that is currently implemented in soda, needs a scan time. We're discussion with @dirkgroenen whether we will open a new set of APIs on the backend that relax that contract but there are quite a few things that the backend needs to figure out with this as it forces the backend to come up with a new way to manage alert notifications lifecycle.TestResults
contract so that you don't have to worry about this for the time being.Quick update on one of the bullet points I made earlier:
the TestResult that is currently implemented in soda, needs a scan time. We're discussion with @dirkgroenen whether we will open a new set of APIs on the backend that relax that contract but there are quite a few things that the backend needs to figure out with this as it forces the backend to come up with a new way to manage alert notifications lifecycle.
The backend won't commit, at this point, to have those new API endpoints ready for this version. So we'll have to make do with the current limitations. This means:
TestResult
object, we will need to come up with a scan_time
to satisfy the requirements of the backend API contract.Is there some kind of time available in manifest that we can use?
We could use the time of the run from the run_results.json
I believe yes.
Note: we do not use the time mentioned in the run_results
. The time mentioned there is per test, where we want one per table. The current implementation inserts the time for when the dbt ingest
is executed.
I think for now, that will be a caveat we have to live with...
We'll have to make sure to document it quite well. I'll add that to my list when I talk with @janet-can for documenting this feature.
I think in a second iteration @vijaykiran we might want to have a tighter integration indeed.
I'm copying one of @JCZuurmond 's message from the community slack here for posterity:
An alternative, interesting approach would be to replace the dbt test functionality with the soda test functionality (similar to this repo https://github.com/dmateusp/dbt-opentelemetry). The way-of-working I am envisioning is that a user install dbt-soda and all test will be executed trough the Soda engine instead of dbt. The benefit of this tight integration is that it resolves the issue with executing SQL twice and delays between execution of SQL and uploading it to the Soda cloud (see issue for more detailed explanation). The big downside is that it requires some monkey patching of the dbt test functionality, which might be hard to maintain, depending on how stable dbt is.
An alternative to the alternative is to implement the issue Bastien created, and that we find a way to add a post hook to the dbt test that uploads the results to the Soda cloud. I created an issue for this: https://github.com/dbt-labs/dbt-core/issues/4333. It will not solve the running SQL twice issue, but it minimizes the delay mentioned before and reduces the complexity of uploading the dbt artifacts to the Soda cloud.
Good to know, @bastienboutonnet. This is now on my radar for documentation.
Context
Users who use dbt have spent time crafting tests in their transformations. Because we don’t want users to repeat themselves to benefit from Soda Cloud’s visibility and incident management we plan to:
User Flow
Users should be able to call a command such as
soda ingest --tool dbt
, provide soda-sql with paths to the dbt artifacts, and have their tests results show up in the Soda Cloud UI. We will have to be clear in our documentation that ingestion should be orchestrated once dbt has been invoked and the transformations have completed. Soda-sql (at least in this first version) will not integrate with the core dbt API. This means it will not orchestrate a dbt run or be called by dbt. Those scenarios may be revisited in further iterations of our integration.We may also consider creating a new
soda
command such assoda capture
which would runsoda scan
and follow it withsoda ingest
or a flag onsoda scan
such as--ingest dbt
. Up for discussion and TBD once minimal isolated approach is working. This is sugar coating at this point.Overall Approach
In this v1, we want to keep it simple. dbt generates two valuable artifacts:
manifest.json
andrun_results.json
. When it comes to test results, we can parse the results from therun_results.json
and use the info in themanifest.json
to gain knowledge about the table (dataset in Soda's language) that it maps on to.Once we have parsed all the info we need (see below) we can compose a
TestObject
similar (but not exactly) to the soda-sqlTestObject
that is then exposed to the Soda Cloud backend API. The UI will show the ingested tests on the dataset view, those tests will be clearly identified asexternal
and display a dbt logo or whatever UI choice makes sense.Specific Requirements
We will want to parse and represent the following information from dbt test results:
test_run_at
: timestamp of when the test was run.test_result
: whether the test is passing or failing.severity
: if this info is available (I know it is for freshness but maybe not for all).database
schema
table
tool_name
: in this case it would bedbt
(this information will be used by the Soda Cloud UI to display the test provenance and govern available attributes and details)compiled_sql
: dbt should give us access to the compiled sql snippet that it ran for the test. We may want to parse this as we can show it in the Soda Cloud UI and it is valuable information that users can use as a starting point for investigation and issue resolution. If users have enabledstore_failures
we should be able to obtain theselect
statement where they can visualise those failures. If this is possible we could consider adding those to the requirements. More info on this can be found here: https://docs.getdbt.com/docs/building-a-dbt-project/tests#storing-test-failuresConcious Limitations
{{ ref() }}
in their declaration and in that case the mapping should be available in a structured way just like any other tests (manifest.json
and `run_results.json)store_failures
(this may be something we offer as a stretch goal as if fits ourfailed_rows
model (see last point of Specific Requirements