sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.87k stars 203 forks source link

Define link between contract and data source #2167

Open tombaeyens opened 2 days ago

tombaeyens commented 2 days ago

Options

  1. A contract file refers to all required files via relative file path references
  2. A contract refers to the data source by name. Data sources are defined in other files.
  3. A contract does not refer to a data source. Connection is made in the API. Eg run this contract on this data source.

Related use cases:

Use case: During contract verification, create the connection to the data source.

When verifying a contract, the implementation needs to know all the data source configurations in order to create a SQL (DBAPI) connection to the data source. This is related to https://github.com/sodadata/soda-core/issues/2164

Use case: Collecting the files for a contract verification

When triggering contract verification with local files to be executed on the agent, all the files for contract verification have to be collected and sent to the agent.

This promotes option (1). It's good if we have actual file reference pointers from the contract file to the data source and all other included files so that they can be collected easily without requiring a full scan of the file base.

An alternative for this use case is to create a new yam configuration for a contract verification that points to the contract(s) and the data source.

Use case: Generating a contract skeleton file based on data source metadata

This may require a contract file path naming convention based on {data_source_name}/{database_name}/{schema_name}/{table_name}.contract.yml where the respective names are transformed to replace chars that are not valid for any operating system.

If we force a naming pattern, the files don't have to be explicitely mentioned in the contract file. If we

Use case: Importing a git repository

When Soda Cloud connects to a git repo, depending on the choices made, a scan needs to happen of all the files. And Soda Cloud would or would not have to maintain metadata that links contract files with data source config files.

tools-soda commented 2 days ago

CLOUD-8489