odpi / egeria

Egeria core
https://egeria-project.org
Apache License 2.0
809 stars 261 forks source link

add support for Data Class assignment #314

Closed cassioinboston closed 6 years ago

cassioinboston commented 6 years ago

In the context of this issue, a data class is an asset that categorizes database columns and data file fields according to the type of the data they hold and how the data is used. Data classification is the process of assigning a data class to a column or field. Data classification is meant to drive/guide policy enforcement.

Examples of data classes include: Date, Zip Code, Credit Card Number, Gender, Country Name, Person Name, Address, SSN, etc.

Data classes can be inferred by a process/engine or assigned by a user. A DataClass includes properties like name, description, code, type, specification e.g. regex, Java, valid values), threshold, example, etc. It can have relationships to other DataClasses (e.g. hasSub, isSubOf).

A data class assignment includes properties like level of confidence, who assigned it, assignment method (manual, process), date, etc. A data column can have multiple data classes assigned to it, and one can be assigned as the primary data class (e.g. the one with the highest level of confidence) and others as secondary data classes.

We have two main alternatives to add data class and data classification as defined above to the egeria open metadata type system:

With this option, a data class assignment would be modeled as a Classification of a Referenceable and would flow automatically with the entity its associated with. This would be analogous to other types of governance classifications defined in "0422 Governance Action Classifications", like Confidentiality, Retention, etc. Properties of a data class assignment would be stored as properties of the DataClass classification. Properties of the DataClass would need to be repeated in every classification for the same data class. There can be only one classification of a particular type associated with an entity, which would require different types of DataClass classification to be added to support assignment of multiple data classes. Hierarchy of data classes (e.g. VisaCardNumber is a sub-class of CreditCardNumber) would not be naturally modeled.

With this option, a data class would be modeled as an entity on its own and a data class assignment would be modeled as a relationship between a DataClass entity and a Referencable entity. This would be analogous to the way glossary terms are assigned to referenceables in "0370 Semantic Assignment". The DataClass properties would be stored with the data class entities, of which there could be a number of pre-defined instances defining data class hierarchies for different contexts (industries) analogous to glossaries, out of the box. New data classes can be added to existing data class hierarchies and new data class hierarchies can be defined by applications.

Tags were originally modeled as classifications and later converted to entities. Given the above, it seems that data classes should be defined as entities and data classification as relationships, for similar reasons.

Data class and classification model elements could be defined in area 4 (Governance) or area 6 (Discovery).

mandy-chessell commented 6 years ago

Another approach is to add a new classification to the glossary term called "DataClass". Then use the existing SemanticAssignment relationship to link it to a db column. Inheritance relationships can be handled using the existing glossary relationships and discovery annotations can use the existing glossary annotations.

cassioinboston commented 6 years ago

@mandy-chessell I guess you're suggesting we use the ISARelationship relationship between GlossaryTerms to model data class hierarchies, and that we store properties that are specific to a data class in the "DataClass" classification (like code, specification) associated with the entity, and use the ValidValues relationships when a data class is specified via valid values (what are valid values for a GlossaryTerm btw?). I wonder if you had classification in the back of your mind when defining the Glossary types. You could have defined DataClass instead of GlossaryTerm and in that case we would use a GlossaryTerm classification to define terms. We can use that approach to model any type of semantic assignment.

While I see how that approach could work, I'm not sure this would be the most natural or intuitive way to define those types, and it would require applications to filter GlossaryTerm instances based on the classification to avoid mixing semantics (something the OMAS level could maybe do). Since this is a new type system being defined from scratch, if we decide to follow that approach I'd like to understand why we think it would be a better one compared to defining DataClass as a separate entity.

Perhaps another approach could involve defining an abstract entity type, say Classifier, from which both GlossaryTerm and DataClass entity types could inherit, and define the SemanticAssignment relationship at the Classifier super type, and make SemanticAssignment a super type of GlossaryTermAssignment and DataClassAssignment so that properties and enumerations specifics to each assignment could be defined separately and common properties inherited.

I think you are in a better position to judge which approach we should follow based one requirements and the patterns used in other parts of the Egeria type system.

mandy-chessell commented 6 years ago

Hello Cassio, I am happy to go with the data class as a separate entity if you feel it is conceptually different from a glossary term. If we put this in Area 6 (Discovery) then it would naturally be considered a subclass of Annotation? This means it can be assigned to an asset (or schema element) through the Discovery Engine OMAS (used in the discovery server/open fiscovery framework). It would then be assigned to the appropiate schema element.

Strictly speaking, the annotation is just a suggestion from the discovery service. For other annotations (like the glossary term annotation), they are converted into approved entities by a stewardship process. (This could be manual or automated.) If we were to follow this pattern then in area 6 we would have something like a DataClassAnnotation that the discovery service attaches to schema columns etc. There would also the proper DataClass entity (probably in area 4 (governance) as you suggest) and a relationship to a schema element. This represents the approved Data Classes. They can be added directly or through a stewardship process that converts the DataClassAnnotation to a DataClass. Does that give you the flexibility that you need?

Also - what attributes would you like in the DataClass/DataClassAnnotation - and relationships?