pietermartin / sqlg

TinkerPop graph over sql
MIT License
245 stars 51 forks source link

Meta data on properties #67

Closed pietermartin closed 2 years ago

pietermartin commented 8 years ago

sqlgmetamodel This issue references #64 as a place to discuss the feature and other features of properties.

Unfortunately I have only briefly looked at what @metlos has done so far. Here are some comments on ideas I had.

My ideas are primarily inspired by the UML meta model.

A property in UML has a meta properties, multiplicity, isUnique and isOrdered. I envisage to add these to sqlg where it makes sense.

Sqlg has its own meta model. Basically the property graph meta model is captured as a first class citizen in the sqlg_schema

schema --> vertex -- inEdges --> edge
                              -- outEdges --edge
                  --> property
           edge   --> property

So on the property I would like to capture and support the mulitpicity, isUnique and isOrdered meta properties.

isUnique

64 deals with isUnique

Only it does not make isUnique a property of the property but instead makes UniqueConstraint a first class citizen on the meta model. I would like to avoid that if possible.

However #64 has the additional semantics that a UniqueConstraint can be applied to many or any/all Vertices or Edges. To be honest this is something I have never even thought of. It kinda breaks the meta model as it exist now in that a property always belongs to only one vertex/edge label.

To implement this a separate table is created with the unique constraint on it. So although I can see that this is the only way outside of complex triggers to support this feature it is not really acceptable for the simple case of just a normal unique constraint on a property. Mostly because it will most likely have a significant performance impact. Currently it requires an additional table plus constraint.

For the simple case a normal unique index on the property will suffice. As I understand there is no need for an additional table nor constraint for the simplest case.

Regarding specifying a uniqueness constraint on all properties with a particular name regardless of the vertex/edge label, can you, @metlos, tell me your use case? I understand it as a form of inheritance however generally I resolved that higher up the stack.

multiplicy and isOrdered

The idea here is set semantics. if isUnique=true, isOrdered=false and the multiplicity>0 then a Set is returned. if isUnique=true, isOrdered=true and the multiplicity>0 then a OrderedSet is returned. if isUnique=false, isOrdered=true and the multiplicity>0 then a List is returned. if isUnique=false, isOrdered=false and the multiplicity>0 then a Bag is returned.

Further a multiplicity>0 implies a not null constraint on the property. Basically it implies a required property.

isUnique, isOrdered and multiplicity can also be added to the inEdgesoroutEdgesmeta edges insqlg_schema`.

The idea here is to add additional semantics to edges. For TinkerPop all relationships between vertices are many to many. By adding isUnique, isOrdered and multiplicity this can be semantically richer supporting one to many including list/set/orderedset and bag semantics.

The addition of the topology as a first class citizen in sqlg is the first step towards all these implementing these ideas.

metlos commented 8 years ago

I came to Sqlg from Titan and there, if you define a unique index on a property, then any vertex with any label is quaranteed to have a unique value (if any) of the property with given name.

It actually took me a while to get used to the different - per label - understanding of the properties that Sqlg uses :)

Therefore I refined my initial UC impl from the Titan-like behavior to also include the more sqlg-like "unique-on-specific-labels". I guess I should have refined it even further to get to your simplest, and frankly most intuitive, understanding of how uc's should work.

But having a unique index spanning many vertex labels has the effect that I actually use heavily - with that you can make sure that certain property across all those labels is unique in each vertex.

To describe what I use it for. In Hawkular Inventory, we store information about the monitored systems. We assume the monitored entities are hierarchical (form trees), which we model by them being connected by "contains" edges which we enforce to not form cycles or diamonds in the graph. Additionally, we allow for other types of edges between our entities that are not this strict. Having the tree structure, allowed us to "address" each item in the tree by its path (path segment names are user-defined). We therefore actually have a unique index on this property to ensure that each entity in our tree is uniquely addressable by its path. This enables quick look-up of the entities by their paths, which is the usual way our traversals start.

This is only possible if all the different types of entities actually are modeled using a single vertex label, which I assume is not ideal perfomance-wise in Sqlg, or if the index on the property spans multiple labels.

Another option would I guess be to move this index to application logic somewhat and first store a "path vertex" that would only store the path and have a unique index on it and if that succeeded, continue with storing the actual vertex (which actually is kinda the same as what my unique constraint impl does under the hood).

pietermartin commented 8 years ago

Ok. To clarify is the "path" that you address a entity by is a normal property. You are not referring to TinkerPop's notion of Path?

metlos commented 8 years ago

Yes, it is a normal property.

pietermartin commented 8 years ago

Ok so in a way its a global property. I'll think a bit about it.

What concerns me is that a special table to store this global property will be a bottle neck uber table. This kind of thing generally bites when its too late. It is however a light table with only one property so perhaps its not so bad.

Is your start traversal of the form `g.V().has("path", "this/that/something")?

metlos commented 8 years ago

I just updated that :smile: It used to be like you wrote: g.V().has("path", "/this/that");

I realized that is probably not that great with Sqlg. Our paths contain the type of the individual segments and therefore I am able to deduce the label of the vertices, so right now, the search starts as (this is not even committed yet): g.V().hasLabel("deducedFromPath").has("path", "/this/that");

That actually shaved off some 20% from the running time of our unit tests. Not sure about the actual speed up of the query running time but it seems non-trivial.

pietermartin commented 8 years ago

For g.V().has("path", "/this/that/") Sqlg will query every table with a "path" property. 100 tables with "path" = 100 queries. Narrowing it down to the label is a indeed a good idea.

JPMoresmau commented 7 years ago

The discussion seems to be about something slightly different than the title of the issue, but I have another use case which I think fits with the title, at least: I'd like to be able to add metadata information. for example, I would like to add a property to edges, something my application could use. If I modify the topology directly via the gremlin operations, the write operation works, but on restarting sqlg does not read it because its metamodel is fixed (see SchemaManager.loadTopology()). Is it considered that maybe the metadata could be reflected from the schema?

pietermartin commented 7 years ago

Not sure I understand? How are you adding a property directly to the topology? As things stands to add meta data directly to the topology, i.e. sql_schema is not really supported as in I never considered that. It should work if you add the edge with the additional property, commit and then delete it again. If that does work then its a bug.

The meta data this issue refers to is more meta meta data. i.e. Multiplicity, Order, Global Uniqueness of a property, or for that matter edge.

If you are going to have a look at this then use/look at the removeHazelcast branch as I have done the initial refactor there. Basically to completely remove the current distributed map way of storing the schema to a more OO manner reflecting the topology as captured in the db. And using Postgresql's notify functionality to distribute the schema to other jvms. In this branch I have not yet looked at making the topology directly editable but it is and was part of the plan of this refactor.

Currently all tests except for the unique constraint tests are passing. After implementing simple unique constraint on a single property I was planning on making the public api for changing the topology solid. After that I was thinking of investigating GraphQL or whatever is the best way to have a text representation of the topology. From minor googling GraphQL, although it seems super nice to support GraphQL, it was not obvious to me as the best topology representation. Need to stare at their IDL definitions more to get it.

JPMoresmau commented 7 years ago

Yes, I'm adding properties to the Edge vertex in sql_schema. The write works ok, but if I disconnect and reconnect the property is not recognized because the Edge vertex and its properties are harcoded. If I try to recreate the property, it fails because the column already exists. Not critical, but it'd be good to add metadata to specify that a certain Edge has a certain property that my application could make use of. So I suppose it's similar to the isUnique, except that I don't expect sqlg to do anything with it, just store it for me.

pietermartin commented 7 years ago

Ah I get it. Currently Edge->property is indeed hardcoded/assumed to be a regular property not a meta property. I'll think about how to allow users to create custom meta properties.

Just to check again, is your property a meta property i.e. not a column?

Can you give me an indication of what your meta property is? Is it expected to have any impact on the db itself, like a column or constraint or something or just pure meta data for the client to use?

I presume sqlg_schema. Vertex/Edge and Property should all support custom meta properties then.

JPMoresmau commented 7 years ago

The use case for us is to implement a "contains" type of relationship, I suppose a kind of foreign key "with delete cascade". When we delete a vertex, we also want to delete all vertices linked by special edges from that vertex. We know which edges have these properties. We can of course store this directly in the data (we have the info when we create the vertices) or in a separate table, but just being able to add a property "isContainerEdge" or something like that on the edge in the topology would be great.

pietermartin commented 7 years ago

Ok nice, going with UML semantics they call it a composite relationship. I have not thought of this one yet with regards to the topology refactor.

Attached is the latest UMLG diagram of the ideas I have had. I have not started on the meta stuff yet as I'd first like to complete the first refactor iteration.

However I drew some UML to remind me. To what is already there isComposition could be added. The uml diagram is in sql_core/src/main/model. You should open it with eclipse papyrus if you wanna have a look at it. topologyclassdiagram

pietermartin commented 2 years ago

Will use new tickets to address higher order uml features.