sodadata / soda-sql

Soda SQL and Soda Spark have been deprecated and replaced by Soda Core. docs.soda.io/soda-core/overview.html
https://docs.soda.io/
Apache License 2.0
59 stars 16 forks source link

Allow users to ingest dbt test results into Soda Cloud via soda-sql #160

Closed bastienboutonnet closed 2 years ago

bastienboutonnet commented 2 years ago

Context

Users who use dbt have spent time crafting tests in their transformations. Because we don’t want users to repeat themselves to benefit from Soda Cloud’s visibility and incident management we plan to:

User Flow

Users should be able to call a command such as soda ingest --tool dbt, provide soda-sql with paths to the dbt artifacts, and have their tests results show up in the Soda Cloud UI. We will have to be clear in our documentation that ingestion should be orchestrated once dbt has been invoked and the transformations have completed. Soda-sql (at least in this first version) will not integrate with the core dbt API. This means it will not orchestrate a dbt run or be called by dbt. Those scenarios may be revisited in further iterations of our integration.

We may also consider creating a new soda command such as soda capture which would run soda scan and follow it with soda ingest or a flag on soda scan such as --ingest dbt. Up for discussion and TBD once minimal isolated approach is working. This is sugar coating at this point.

Overall Approach

In this v1, we want to keep it simple. dbt generates two valuable artifacts: manifest.json and run_results.json. When it comes to test results, we can parse the results from the run_results.json and use the info in the manifest.json to gain knowledge about the table (dataset in Soda's language) that it maps on to.

Once we have parsed all the info we need (see below) we can compose a TestObject similar (but not exactly) to the soda-sql TestObject that is then exposed to the Soda Cloud backend API. The UI will show the ingested tests on the dataset view, those tests will be clearly identified as external and display a dbt logo or whatever UI choice makes sense.

Specific Requirements

We will want to parse and represent the following information from dbt test results:

Concious Limitations

JCZuurmond commented 2 years ago

@bastienboutonnet : thanks for doing this write up!

My plan is as follows:

  1. Add an ingest command to the cli
  2. Parse the manifest.json and run_results.json to get the required test and model info from the dbt artifacts.
  3. Map the parsed dbt tests and models to a Soda test result

    Tests and models are matched as follows:

    1. Get tests from run_results
    2. Search for test in manifest using the unique_id
    3. Get the models from the depends_on
    4. Look for models in the manifest using the unique_id
  4. Flush the test results to the Soda cloud
bastienboutonnet commented 2 years ago

Hey @JCZuurmond, I think your plan sounds really good to me!

Where I see there will be a little back and forth might be around the mapping work for two reasons:

bastienboutonnet commented 2 years ago

Quick update on one of the bullet points I made earlier:

the TestResult that is currently implemented in soda, needs a scan time. We're discussion with @dirkgroenen whether we will open a new set of APIs on the backend that relax that contract but there are quite a few things that the backend needs to figure out with this as it forces the backend to come up with a new way to manage alert notifications lifecycle.

The backend won't commit, at this point, to have those new API endpoints ready for this version. So we'll have to make do with the current limitations. This means:

vijaykiran commented 2 years ago

Is there some kind of time available in manifest that we can use?

bastienboutonnet commented 2 years ago

We could use the time of the run from the run_results.json I believe yes.

JCZuurmond commented 2 years ago

Note: we do not use the time mentioned in the run_results. The time mentioned there is per test, where we want one per table. The current implementation inserts the time for when the dbt ingest is executed.

bastienboutonnet commented 2 years ago

I think for now, that will be a caveat we have to live with...

We'll have to make sure to document it quite well. I'll add that to my list when I talk with @janet-can for documenting this feature.

I think in a second iteration @vijaykiran we might want to have a tighter integration indeed.

I'm copying one of @JCZuurmond 's message from the community slack here for posterity:

An alternative, interesting approach would be to replace the dbt test functionality with the soda test functionality (similar to this repo https://github.com/dmateusp/dbt-opentelemetry). The way-of-working I am envisioning is that a user install dbt-soda and all test will be executed trough the Soda engine instead of dbt. The benefit of this tight integration is that it resolves the issue with executing SQL twice and delays between execution of SQL and uploading it to the Soda cloud (see issue for more detailed explanation). The big downside is that it requires some monkey patching of the dbt test functionality, which might be hard to maintain, depending on how stable dbt is.

An alternative to the alternative is to implement the issue Bastien created, and that we find a way to add a post hook to the dbt test that uploads the results to the Soda cloud. I created an issue for this: https://github.com/dbt-labs/dbt-core/issues/4333. It will not solve the running SQL twice issue, but it minimizes the delay mentioned before and reduces the complexity of uploading the dbt artifacts to the Soda cloud.

janet-can commented 2 years ago

Good to know, @bastienboutonnet. This is now on my radar for documentation.