Open eoriorda opened 2 years ago
Currently access controls on Cluster 2 limit creation of schemas and sandbox / demo_dv seem to have a growing number of tables. Is the lack of access to schema preventing the migration of pipelines from cluster 1?
What I propose is we should:
@MichaelTiemannOSC for your comments please @HumairAK for your information only since you asked about planning for retirement of cluster 1
@erikerlandson suggested I test out the PR mechanism to get a new schema activated, which I've done via this PR: https://github.com/operate-first/apps/pull/2112. I saw the need to request such because the open metadata pages for CL2 show many tables as "deleted", and I wasn't sure who deleted what. I wanted to ensure that I had a safe place to prototype both data ingestion and metadata.
I see your second bullet point (Each pipeline should have an assigned repository with 3 groups: admin (can write schema and do everything under), dev (cannot create / delete schema but can create table and read anything), contributors (can read)) as a minimal description of data governance from a CI/CD perspective. There's obviously a lot more that goes into lineage, provenance, quality, etc., but I agree that we should have a simple description of how data pipelines and schemas interact for data that is ingested into Trino. We'll need to do the same thing for data that's federated as well.
For both the global sandbox as well as any personal sandbox schemas, we should describe patterns/best practices for migrating tables from those into more official schema, which of course requires that the data have "owners".
@caldeirav Vincent each pipeline owner would need to own the decision to move the pipeline. Can we identify one pipeline owner who wants to do this @MichaelTiemann Micheael would you like to be the tip of the spear.
I think the next logical step is to do this for the first pipeline we would want to have on the so-called "stable cluster". Then whatever we define in terms of schema, organization of RBAC, etc... can become the baseline for future pipelines.
Trying to get member led pipeline management.
Michael T and Vincent has provided feedback . Not ready to define a prod pipeline until team have experience in generating more pipelines and defininig in totality what a pipeline should. Need a repeatable pipeline running in dev before we start defining a prod cluster.
Michael suggesed that we are getting to a point that we can transfer technical ownership of ESSD which is a data pipeline.
@caldeirav does this touch fybrik
In the long run yes because we would then transfer the definition and management of rules to Fybrik via Open Policy Agent i.e. at the Trino layer we would likely move to grant all access. This is not short term though.
@erikerlandson any update on this one?
xref: https://github.com/os-climate/os_c_data_commons/issues/135 https://github.com/os-climate/osc-trino-acl-dsl/issues/6