sodadata / soda-core

:zap: Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io
https://go.soda.io/core-docs
Apache License 2.0
1.87k stars 203 forks source link

Define git sync strategy #2170

Open tombaeyens opened 1 day ago

tombaeyens commented 1 day ago

Image

Initially we'll build a mechanism that stores all files in Soda Cloud. But the target is that later we build a git repository synchronization.

If we want git synchronization, the models have to match. That means that Soda Cloud will need to have a list of repositories. Also in the case that only Soda Cloud storage is used. Each of the repositories can contain data source and contract yaml files. We will need editors for both file types.

When a user opens a contract editor, a 'clone' has to be performed of the full repository. This means that for that user, a copy of the full repository needs to be made and associated to the user. The user changes the cloned files in the editor session and when saving, a patch file is created. This patch file can be applied immediately to the Soda Cloud repo files. Or the patch file can be linked to a proposal for later merging.

When a patch is saved, the Soda Cloud repository files are updated. In case a git repository is linked, the updates are pushed to git.

Concurrency handling: we need to decide if we allow for concurrent proposals and concurrent updates. If we do this, it will complicate the solution significantly. In that case we have to work with git patch files. Potentially this results in conflicts to be resolved in the Soda Cloud web UI editor. This pushes us in a direction of building a web UI git client to do merges. It's complex and not core to our value. Alternatively we could have a single-write-lock implemented in Soda Cloud editor UI so that there is max 1 editor at any time and conflicts are avoided.

tools-soda commented 1 day ago

CLOUD-8501