sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.82k stars 193 forks source link

Contract identity issue #2061

Open tombaeyens opened 2 months ago

tombaeyens commented 2 months ago

We found a problem with the lib api. When we have multiple checks we have an error that is considered an execution error. You can find the error in the source. We thought that maybe the error was being added to the logs, but the code after the log fixed the issues, so the logs shouldn't exist. For now the solution will be to add the field identity_suffix into the check. But I think this should be fixed on SODA side, or using more field to create the ID or not log the error and use the approach with index.

contract.py

logs.error(f"Duplicate check identity '{check_identity}': {other_check.location} and {check_location}")

Another question: Multiple checks need the field metric to be defined; what is the use for that field? Will that field always be mandatory?

tools-soda commented 2 months ago

SAS-3274

tombaeyens commented 2 months ago

The problem with the current identity generation is that dataset level checks often will need an identity_suffix. We need to improve the default behavior so that in less circumstances the user needs to fiddle with the identity.

tombaeyens commented 2 months ago

Goal: The goal is create a better correlation mechanism between checks in the files and checks in Soda Cloud. We want to impose the least amount of burden on the user and avoid the need for users to configure correlation ids themselves. But we also want to ensure that users can edit contract files freely and move checks around while still preserving the identity. Not preserving the identity causes a new check on Soda Cloud being created, loosing the metric history and historic check results.

Proposed solution 1:

The check identity is composed of

Users then only have to specify a correlation_id if the scope and check type specific correlation properties do not distinctly identify a check in the source YAML file.

Proposed solution 2:

Check identity is a composition of

We can use the check name to provide uniqueness. But in that case, users have to know that changing the name potentially will change the check id and hence break history in Soda Cloud.

aabf commented 2 months ago

In the hellofresh use case, the name is handled as a unique ID, and the user must fill it. I think the solution 2 can be the best one. We can write our docs explaining that the name is also linked to the soda cloud, and by changing the name, the history will be lost. But I think the users expect that because it is a unique ID for us.

tombaeyens commented 2 months ago

More analysis notes:

Solution 2 would also work best on the Soda Cloud backend.
Based on the correlation properties, contracts lib should create an identity property in the genrated SodaCL using a hash (and not the full property serialized text) We can offer a renaming check workflow through an name_deprecated property. That should translate to an identity_deprecated in SodaCL. Soda Cloud backend work should be planned to support the identity migration. The changes needed in Soda Cloud backend should not be too big as the data model is matching this strategy.