open-metadata / OpenMetadata

OpenMetadata is a unified metadata platform for data discovery, data observability, and data governance powered by a central metadata repository, in-depth column level lineage, and seamless team collaboration.
https://open-metadata.org
Apache License 2.0
5.04k stars 961 forks source link

Define filters, conditions and actions for different policy types #950

Closed mithmatt closed 2 years ago

mithmatt commented 2 years ago

As part of #720 the initial policy schema, entity and API was defined.

Extend this to define filters, conditions and actions for Lifecycle, Access Control policies that can be directly translated to that of cloud vendors and other open source solutions.

Eg: TBAC in AWS Lake Formation: https://docs.aws.amazon.com/lake-formation/latest/dg/TBAC-overview.html

Additional Notes here: https://docs.google.com/document/d/1svyYBpSNBq3XUhALNCOuS-WTYtHNE-NMnghwszDe2G8/edit#

amiorin commented 2 years ago

Another approach is to have policies in plain English in OM and machine-readable policies in the target system linked to the policies in plain English. The policies in plain English can be assigned manually or based on matching conditions, for example: it's a table with a location on S3 and it has tier5 -> apply policy retention 90 days that it is implemented differently in the 3 different cloud providers.

I would not try to define a policy language that is a superset of the different policies languages in the different clouds.

WDYT?

mithmatt commented 2 years ago

Another approach is to have policies in plain English in OM and machine-readable policies in the target system linked to the policies in plain English. The policies in plain English can be assigned manually or based on matching conditions, for example: it's a table with a location on S3 and it has tier5 -> apply policy retention 90 days that it is implemented differently in the 3 different cloud providers.

I would not try to define a policy language that is a superset of the different policies languages in the different clouds.

WDYT?

This is an interesting thought.

@amiorin

How do you intend to represent "table with a location on S3 and it has tier5 -> apply policy retention 90 days" in OpenMetadata?

Doesn't this have to be in some structured form that can be translated by bots into the cloud provider specific policy language?

amiorin commented 2 years ago

How do you intend to represent "table with a location on S3 and it has tier5 -> apply policy retention 90 days" in OpenMetadata?

One option could be a tag with a JSON schema to store the integer 90. Another option is an entity with a relationship.

Doesn't this have to be in some structured form that can be translated by bots into the cloud provider specific policy language?

It can have some structure in some cases. The use case I have in mind where the structure is difficult to define is fine-grained access control. A fictional online shop employee can only access a section of the table sales based on row-level filtering and column-level filtering. The policy in English can be "He can access toy sales data from wholesale in region NA" that gets translated by the data owner in a policy in the DWH and associated with the employee when a data access request is submitted. The challenge is that the filtering "category = toy" can be different in every DWH. Both the column name and the values can be specific to the DWH instance.

mithmatt commented 2 years ago

One option could be a tag with a JSON schema to store the integer 90. Another option is an entity with a relationship.

Yes, I think this can be modeled pretty easily. I have initial schema in place for tag based access control. Should be able to use similar concept here.

It can have some structure in some cases. The use case I have in mind where the structure is difficult to define is fine-grained access control. A fictional online shop employee can only access a section of the table sales based on row-level filtering and column-level filtering. The policy in English can be "He can access toy sales data from wholesale in region NA" that gets translated by the data owner in a policy in the DWH and associated with the employee when a data access request is submitted. The challenge is that the filtering "category = toy" can be different in every DWH. Both the column name and the values can be specific to the DWH instance.

Yes, fine-grained access control (select on rows when value of column X = Y) is one of the most complex ones (to model) I have come across so far.

I'm modeling the ones which have an easy structure to begin with.

I believe that there is a finite set of different types of policies that we'll need to model. I've prepared some notes based on my research and thought process that I'd be happy to present at a community meetup sometime in the future.