os-climate / os_c_data_commons

Repository for Data Commons platform architecture overview, as well as developer and user documentation
Apache License 2.0
20 stars 10 forks source link

Developer schemas and user rights for Dev environment #148

Open eoriorda opened 2 years ago

erikerlandson commented 2 years ago

xref: https://github.com/os-climate/os_c_data_commons/issues/135 https://github.com/os-climate/osc-trino-acl-dsl/issues/6

caldeirav commented 2 years ago

Currently access controls on Cluster 2 limit creation of schemas and sandbox / demo_dv seem to have a growing number of tables. Is the lack of access to schema preventing the migration of pipelines from cluster 1?

What I propose is we should:

@MichaelTiemannOSC for your comments please @HumairAK for your information only since you asked about planning for retirement of cluster 1

MichaelTiemannOSC commented 2 years ago

@erikerlandson suggested I test out the PR mechanism to get a new schema activated, which I've done via this PR: https://github.com/operate-first/apps/pull/2112. I saw the need to request such because the open metadata pages for CL2 show many tables as "deleted", and I wasn't sure who deleted what. I wanted to ensure that I had a safe place to prototype both data ingestion and metadata.

I see your second bullet point (Each pipeline should have an assigned repository with 3 groups: admin (can write schema and do everything under), dev (cannot create / delete schema but can create table and read anything), contributors (can read)) as a minimal description of data governance from a CI/CD perspective. There's obviously a lot more that goes into lineage, provenance, quality, etc., but I agree that we should have a simple description of how data pipelines and schemas interact for data that is ingested into Trino. We'll need to do the same thing for data that's federated as well.

For both the global sandbox as well as any personal sandbox schemas, we should describe patterns/best practices for migrating tables from those into more official schema, which of course requires that the data have "owners".

eoriorda commented 2 years ago

@caldeirav Vincent each pipeline owner would need to own the decision to move the pipeline. Can we identify one pipeline owner who wants to do this @MichaelTiemann Micheael would you like to be the tip of the spear.

caldeirav commented 2 years ago

I think the next logical step is to do this for the first pipeline we would want to have on the so-called "stable cluster". Then whatever we define in terms of schema, organization of RBAC, etc... can become the baseline for future pipelines.

eoriorda commented 2 years ago

Trying to get member led pipeline management.
Michael T and Vincent has provided feedback . Not ready to define a prod pipeline until team have experience in generating more pipelines and defininig in totality what a pipeline should. Need a repeatable pipeline running in dev before we start defining a prod cluster.

eoriorda commented 2 years ago

Michael suggesed that we are getting to a point that we can transfer technical ownership of ESSD which is a data pipeline.

HeatherAck commented 2 years ago

@caldeirav does this touch fybrik

caldeirav commented 2 years ago

In the long run yes because we would then transfer the definition and management of rules to Fybrik via Open Policy Agent i.e. at the Trino layer we would likely move to grant all access. This is not short term though.

HeatherAck commented 2 years ago

@erikerlandson any update on this one?