Closed tomtor closed 2 years ago
@tomtor Can you elaborate on what the product entity will have and what it will represent. I think adding a new entity became simpler now with new code rebase. cc @sureshms
I think adding a new entity became simpler now with new code rebase.
That's interesting, I started on adding a Thesaurus
entity:
https://github.com/tomtor/OpenMetadata/commit/77198d3a2342dd842df47fe9b443ad401da9c3b1
which has a SKOS
property which would contain JSON-LD
data. I get a 500
error when feeding test data to the server, so I probably have to add some Java code to handle REST calls to the thesauruses
API.
After this I figured out that having a generic Entity to store metadata encoded in a JSON representation would be more flexible and require less code. The best name I could think of is product
, but perhaps a better name could be chosen.
So replace SKOS
with definition
and thesaurus
with product
in https://github.com/tomtor/OpenMetadata/blob/77198d3a2342dd842df47fe9b443ad401da9c3b1/catalog-rest-service/src/main/resources/json/schema/api/data/createThesaurus.json to get the idea.
@harshach Some more general remarks/questions about metadata in Open-Metadata.
Say we have a database table with cars and one of the columns is named EngineType
. This can be 'petrol,diesel,electric,hybrid,unknown' (a code list).
How/where would we store this and how do we link this to the column?
How/where would we store this and how do we link this to the column?
To partially answer my own question, we can in the current release add textual descriptions which contain text and URLs.
I think the GUI and generic metadata model should also allow the user to add links (Entityrefs) from any Entity to any other Entity.
@tomtor, can you add details on what is a thesaurus entity in the context of OpenMetadata (thesaurus.json from current patch does not describe what that entity is. See table.json for how to add details). It might be a good to write a simple one page on what it is, how it will be used, and how it is surfaced in the UI.
Given not a lot of people have understanding of SKOS (including me), json-ld etc. this might be a great talk to cover in our biweekly meetings to get everyone up to speed. WDYT?
@tomtor, can you add details on what is a thesaurus entity in the context of OpenMetadata (thesaurus.json from current patch does not describe what that entity is. See table.json for how to add details). It might be a good to write a simple one page on what it is, how it will be used, and how it is surfaced in the UI.
Given not a lot of people have understanding of SKOS (including me), json-ld etc. this might be a great talk to cover in our biweekly meetings to get everyone up to speed. WDYT?
@sureshms Sure, the point is that currently OpenMetadata only allows tagging and some text with URLs embedded to add additional information to datasets. In an organization people are often working with data related to specific real world objects or concepts.
So a column in a dataset is named eg MTBF or Mean-Time-Between-Failure. A user could add a textual description to clarify how MTBF is defined or calculated (this is an example of metadata) and how it relates to other quality aspects. This is important for users of the data, because without this info (metadata) they could use the data in an incorrect way. But what if many different datasets from a department have MTBF data. He/she would have to copy this text anywhere in OpenMetadata?
We need a central place to store this metadata. The idea is that we can define entities which contain a SKOS definition. A department/team would create a SKOS Thesaurus entity to store descriptions of concepts, how they relate, a collection of allowed values (a bit like a JSON-Schema enum
) etc. These could also be created with a connector which reads this data from another system.
The Thesaurus create API takes a parameter with the thesaurus definition data as a (large) string in JSON or another format. We could design our own JSON-Schema for this, but it is better to support the use of an existing standard (SKOS). A thesaurus entity which is displayed in the GUI would show a list of the concepts defined in it. In its simplest form each concept can have a HTML-anchor and the user could copy the URL which points to this entry in the thesaurus. This URL can be pasted eg in the description of a dataset column.
https://www.w3.org/TR/skos-primer/#secintro is good introduction for SKOS.
A quote from that introduction:
SKOS can also be seen as a bridging technology, providing the missing link between the rigorous logical formalism of ontology languages such as OWL and the chaotic, informal and weakly-structured world of Web-based collaboration tools, as exemplified by social tagging applications. The aim of SKOS is not to replace original conceptual vocabularies in their initial context of use, but to allow them to be ported to a shared space, based on a simplified model, enabling wider re-use and better interoperability.
There exists OpenSource SKOS publishing software (SKOSMOS), see eg https://skosmos.v7f.eu/unesco/nl/page/C00737?clang=en for an example of a huge thesaurus (I would expect that Thesauruses in OpenMetadata would be small and tied to a company or team) but it would be nice to integrate this higher level metadata in OpenMetaData.
@tomtor, thank you for the explanation.
I am very happy that you have thought through and added a lot of code. I hope my comments don't discourage you. We have two types of tags - Descriptive and Classification. Example of Descriptive tag is User and Classification is - Tiers, PersonalData and PII. Descriptive tags can be added to describe things such as MTBF and be attached to fields of an entity. And then there are business glossaries that the data world is used to. Some systems uses tags also as business glossary.
We did try to use RDFs, Owl, and Json-ld etc. at Uber in the past. Most developers lacked knowledge about them and pushed back on them for the simplicity of JSON schema + descriptions. I understand the benefits of what you are trying to do, but it comes at a huge complexity to our users and I fear they would not be able to use it. More than formally defining concepts and tying it with other standard concepts to organize knowledge, simply describing it in a place and making it reusable will keep it simple for our users.
Perhaps when we are more mature and have solved basic problems, evolution toward knowledge graph, semantic web etc. might be more appropriate. But when I read OWL, RDF, JSON-LD, and the complex syntax, I feel simply overwhelmed. This will be true for most of our users.
What do you think?
Perhaps when we are more mature and have solved basic problems, evolution toward knowledge graph, semantic web etc. might be more appropriate. But when I read OWL, RDF, JSON-LD, and the complex syntax, I feel simply overwhelmed. This will be true for most of our users.
@sureshms I agree, I am not a linked-data person myself and I am all in for a conceptual simple approach.
The only reason I am proposing doing something with SKOS is that it is an existing standard in which concepts are defined in some organizations.
We do not need to support the full Linked-Data machinery. Translating SKOS to our own JSON-Schema based tagging approach is sufficient. End-users will never see RDF.
We should think of a thesaurus as simply a collection of tags.
A business glossary is just a thesaurus.
An elegant solution would be to extend the current OpenMetadata tagging implementation with tags of the form thesaurus:concept
. When users click the tag they see the definition of this tag and the entities tagged with it.
We could also use the name glossary instead of thesaurus to make it a bit less formal?
Perhaps when we are more mature and have solved basic problems
I am missing some more basic entities in OpenMetaData.
A simple one is REST-services which are implemented by teams in the organization. These entities should at least have info about the endpoints and an optional swagger URL. I can create an issue for this?
A more complex problem is documenting the relation between datasets (or entities in general). I think it's fantastic that OpenMetadata uses graphs of JSON-schemas. That's why it is conceptual superior to other solutions and it allows us to extend the system in an elegant way.
Regarding models, I think it would be better if the current Model
would be named ML-Model
because there are a lot of models and this is confusing. Adding a generic JSON-Model
entity would allow users of OpenMetaData to extend the system with new types of entities (simple by tagging the JSON-Model entity) without writing lots of code. I underestimated the amount of code needed to add my concept Thesaurus
entity. If we add a JSON-schema visualization (#1011) OpenMetadata would be an even better environment to work with Metadata.
A JSON-Model
entity would have two attributes (and the standard OpenMetadata attributes):
One of the two may be empty. (We could also model 2 entities, A JSON-Object entity with an optional reference to a JSON-Schema entity). The GUI would show the JSON-Object and the JSON-Schema in a graphical way. If the JSON-Object has Entity references then the user can select them and the browser will show this referenced Entity.
@sureshms I agree, I am not a linked-data person myself and I am all in for a conceptual simple approach.
The only reason I am proposing doing something with SKOS is that it is an existing standard in which concepts are defined in some organizations.
We do not need to support the full Linked-Data machinery. Translating SKOS to our own JSON-Schema based tagging approach is sufficient. End-users will never see RDF.
We should think of a thesaurus as simply a collection of tags.
A business glossary is just a thesaurus.
This sounds great.
An elegant solution would be to extend the current OpenMetadata tagging implementation with tags of the form
thesaurus:concept
. When users click the tag they see the definition of this tag and the entities tagged with it.We could also use the name glossary instead of thesaurus to make it a bit less formal?
Naming it as glossary would make it less formal and more accessible to our users. 👍
A simple one is REST-services which are implemented by teams in the organization. These entities should at least have info about the endpoints and an optional swagger URL. I can create an issue for this?
You are suggesting API endpoints available as entities that are discoverable. Correct? How about modeling this as a service with service has an API endpoint? We already have many services (database, messaging, dashboard etc.). This could be a type of service (not sure what a good name is - microservice, API service...). What do you think?
A more complex problem is documenting the relation between datasets (or entities in general). I think it's fantastic that OpenMetadata uses graphs of JSON-schemas. That's why it is conceptual superior to other solutions and it allows us to extend the system in an elegant way.
Relationships can definitely have tags as you are proposing in this for conceptual binding. What do you think? They can be class level properties (JSON schema for entity X to Json schema for entity Y) and not instance level properties.
Regarding models, I think it would be better if the current
Model
would be namedML-Model
because there are a lot of models and this is confusing. Adding a genericJSON-Model
entity would allow users of OpenMetaData to extend the system with new types of entities (simple by tagging the JSON-Model entity) without writing lots of code. I underestimated the amount of code needed to add my conceptThesaurus
entity. If we add a JSON-schema visualization (#1011) OpenMetadata would be an even better environment to work with Metadata.
Model is already named as MLModel. It would be great to discuss the details of JSON-Model entities and how it will work. One thing I worry about is, opening up adding new types and entities can result in poorly designed/documented types created leading to duplicate and inconsistent types. This will run against the spirit of standardization and make tool interoperability harder. My initial thought was, let's work in the community and design all the entities. Then let's add extension points to existing entities. And as we learn more about how users are using the system, let's explore allowing adding entities/types programmatically. Thoughts?
Then let's add extension points to existing entities
@sureshms @harshach Yes, the option to extend existing entities without touching the Java codebase (or with just minimal additions) would be great.
Suppose we would like to extend eg the current table Entity with some additional GUI elements (eg buttons which links to URIs or even a visual presentation of the sample data records). I wonder if the React GUI could be passed the Entity JSON and the sample-data (or just the API-URLs to retrieve these) and if we could define extension Javascript code based on assigned tags. Each tag could have an optional Javascript extension script. For security reasons only a certain class of users would be allowed to add these extensions.
I have no React experience, but I assume it should be possible to add Javascript hooks which execute when a GUI element is shown.
It would be great to discuss the details of JSON-Model entities and how it will work
@sureshms @harshach I noticed https://github.com/open-metadata/OpenMetadata/issues/1159
This Model
could be used as a JSON-Model?
What is missing? The GUI which shows the JSON? The option to tag the Model
as a JSON-Schema Entity
or as an instance of a model with an EntityRef
to the schema?
You are suggesting API endpoints available as entities that are discoverable. Correct? How about modeling this as a service with service has an API endpoint? We already have many services (database, messaging, dashboard etc.). This could be a type of service (not sure what a good name is - microservice, API service...). What do you think?
@sureshms Sounds good, API (service) would be a good name?
@tomtor sorry for delay in response. #1159 is adding DBT model as model. Let me know if we should rename it.
1159 is adding DBT model as model. Let me know if we should rename it.
@harshach Name Model
is fine is we could use it to extend to various models.
The model is currently stored in a string
, that is a nice generic type. How will it be visualized in the GUI?
If we can have a default visualization then we could extend the visualization based on TAGS
and/or an added model-type
attribute? I could then also use it to implement Glossary/Thesaurus
instead of the current draft PR https://github.com/open-metadata/OpenMetadata/pull/1087 and in the future other models just by adding the visualization.
Edit: Model could also support https://github.com/open-metadata/OpenMetadata/issues/1275
Name Model is fine is we could use it to extend to various models.
@harshach I now see that Model
is already quite specific with database column handling, so it is not really suited as a generic Model. So perhaps a better name is SchemaModel
or just Schema
?
Closing this issue
I would like to enter metadata in OpenMetadata for which currently no entity is defined, e.g.:
A practical approach would be to define a generic
entity, a bit similar to the existing (ML)Model, but instead of anProduct
algorithm
attribute I would add adefinition
attribute which stores a JSON definition.A tag can be used to specify the type of instances, eg
product
orthesaurus
@harshach Is this a good idea?