opencypher / openCypher

Specification of the Cypher property graph query language
http://www.opencypher.org
Apache License 2.0
852 stars 149 forks source link

Considerations for Set Algebra in OpenCypher (CIP) #318

Open ThomasFrisendal opened 6 years ago

ThomasFrisendal commented 6 years ago

Introduction

This is a proposal for adding more explicit and extensive, and even ASCII art inspired support for set algebra in openCypher.

In my opinion sets of graphs is a once-in-a-lifetime opportunity that is much more than a revival: Doing set algebra on sets of graphs is potentially several orders of magnitude more powerful than SQL-based set algebra (or the simple lists etc. offered by early set analysis tools such as the BO Set Analyzer).

This issue (CIP on Github) is a brief overview of my proposal. All details on background, relationship to other CIP's and actual syntax extension requirements are found in this document: ConsiderationsForSetAlgebraInOpenCypherFullText.pdf

Here follows a brief review.

Why did set algebra not become more popular?

People of my age were taught set algebra at high-school (in my case in the late seventies). Today it is elementary school stuff. And it is indeed a useful tool with applications in many real-life situations.

However, set algebra never made it big time within it tools, applications and databases.

Two historic documents are of interest here:

Dr. Ted Codd. based some of his ideas on an "extended set theory", which was an idea formulated and described in a 1986 paper:

"Description of a Set-Theoretic Data Structure" by D. L. Childs ( https://www.computer.org/csdl/proceedings/afips/1968/5072/00/50720557.pdf ).

But Childs' extensions were not ideally suited, which is explained in quite some detail in this book:

The Algebra of Data - A Foundation for the Data Economy by Professor Gary Sherman & Robin Bloor, Ph.D. © 2015 by The Bloor Group Press.

The authors argue that mainstream Zermelo-Fraenkel set theory (Cantor), would have been a better starting point. One key issue is that sets should be able to be sets of sets.

Who are actualy using sets?

Set operations did not really break out of the SQL zone. I have been writing tons of SQL and in my experience, UNION is used sometimes, whereas INTERSECT and DIFFERENCE are not used very much (and mostly in metadata operations). UNION is most often used as "the poor DBA's ETL-tool" to consolidate data having different shapes from different systems into one structure, often as part of a view (and often spraying NULL's all over the place).

But, SAP owns a Set Analysis tool as part of Business Objects: http://www.crmodyssey.com/Documentation/Documentation_PDF/Business_Objects_Set_Analyser.pdf. And: https://help.sap.com/doc/businessobject_product_guides_boexir31sp3_en_xi31_sp3_sa_user_en_pdf/XI3.1.3/en-US/xi31_sp3_sa_user_en.pdf.

The Business Cases

The most obvious use case was/is customer segmentation (customer behavior analysis) and product campaign planning in marketing applications.

Today there are also strong user stories in the contexts of investigative work flows based on sets of suspects in graph based analysis of crime, intelligence, fraud, churn, recommendations and other behavior / networking analytical areas.

Graphs are Sets

All sets are not graphs. But graphs are sets. In the Data Algebra book by Sherman and Bloor (cf. reference above) there is a good argumentation for this. In fact there is a whole chapter (7) called Data Algebra and Graphs. I am not going to repeat that chain of arguments, but basically graphs can be understood as a relationship (which is not Dr. Codd's kind, and which in the data algebra parlance is called a "clan"), which in turn consists of "couplets" (e.g. representing a start node, the relationship and an end node). And sets of graphs have a lot more information than plain sets of the tabular kind. That is the real opportunity here.

openCypher 10 and multiple graphs

In the openCypher project, openCypher 10 is expected to contain much, sophisticated functionality related to what is necessary for doing "multiple graphs":

As far as I can see the CIR's are good and will most likely lead to the overall goals of graph closure in openCypher.

I am not arguing against any of the above. I see my proposal as being less "graph-technical" of nature and instead it is focused on ease-of-use at the business level by ordinary business folks, analysts and investigators. At this moment (openCypher 10) a unique opportunity exists to get a rock-solid, intuitive and mathematically closed language serving both aspects of the data structures, which people are interested in:

Data Model Considerations (CIP2017-06-18)

Since any graph can be considered a set in its own right, sets can be members of other sets.

Working with Named Graphs and Named Sets

Since many sets of graphs will contain more than one graph, also sets of graphs will have to be able to have a name:

We definitely need the provenance tracking and construction features of CIP2017-06-18, also in the set context.

Working with sets (of graphs) means that whenever the current proposals make reference to a specific graph (using FROM, for example), a set of graphs should be possible to use at that place.

Set Operations Considerations (CIP2017-04-20)

I have been boggling my mind trying to come up with some ASCII art for set operations. The point is that set operations will probably (mostly) be performed in a set context:

That means that the syntax to be used should be compact and intuitive (w/o command like syntax). This calls for some ASCII art, but I think ASCII (7-bit) has run out of steam?

Catalog Considerations (CIP2018-05-03)

Working on the set level: what I would like to see is (in loose terms) that the graph content be saved under a name, and that it can be recalled as either the graph or the paths constructing the result graph (again).

We are looking at the catalog, where we probably will need to handle scope of "namespaces", which span multiple users / groups or other hierarchical organisations of named objects.

Think fraud investigators as the business users and what they would like to be able to do.

Conclusion

There are not many additional things to do in order to provide a good platform for building solutions, which employ set algebra style applications on top of sets of graphs.

The business benefits of being able to do so are substantial, so I hope that openCypher 10 will be able to accommodate this functionality, too.

Here is the list of the 7 identified requirements:

Requirement 1: There should be support for "sets of graphs", which may include sets of graphs as members.

Requirement 2: Property graphs can have not only a name, defined by the user (supported by CIP2018-05-03), but also a description.

Requirement 3: Sets of graphs can have a name, defined by the user, who may also maintain a description of the set. This even applies to empty (at the moment) sets of property graphs.

Requirement 4: openCypher should include syntax for specifying set membership in place (where the set is named anyway). This includes membership in more than one named set.

Requirement 5: openCypher should include syntax for declaring set operations as part of the construction of sets of graphs. It should be done in an "ASCII art" style.

Requirement 6: There is a need for being able to create a "dynamic set of graphs" or a "dynamic graph".

Requirement 7: The catalog should support end-user ownership and maintenance of "libraries" ("folders" of named graphs and named sets of graphs).

7 could be implementation specific.

1, named sets of graphs will have to plug in in many places of the language, but from what I have seen, I don't believe there are any conflicts. If users define meaningless sets of graphs, the (lack of) matching results will quickly show them that matching apples and pears generally is only a good idea in a shared context.

The latest years have shown that business oriented analysts, data scientists and investigators can use openCypher. Not least because of the design of the language to be intuitive by way of "ASCII art". By doing that again, this time on set algebra, these communities will be able to work even more productively. Workflows will focus on finding, cataloguing and reusing named sets of graphs to produce rock-solid and documentable contexts of patterns of behavior.

The full text including background, relationships and examples is found here: ConsiderationsForSetAlgebraInOpenCypherFullText.pdf

git commit --signoff

Mats-SX commented 6 years ago

@ThomasFrisendal Thank you for a very interesting proposal. I took the liberty of doing some editorial changes to your text; please object if you feel I've misconstrued something.

I'm wondering a bit about the set idea. Normally in Cypher when you do a match, like so:

FROM my.catalog.graph
MATCH (a)-[r]->(b)

the result is a binding table containing three columns a, r and b. If we imagine that the FROM clause was provided not a single graph but a set of graphs, will the result of the above MATCH then be a set of tables?

Assuming the answer is affirmative, what happens if we were to RETURN? Is the result the set of binding tables? How does this play with respect to concurrency; if the query is operating on a set of graphs, will it semantically do each operation on the full set before continuing with the next operation on the "first" member of the set? Especially in the presence of updates this becomes important to consider.

ThomasFrisendal commented 6 years ago

Hi Mats

Thank you - I am glad that you like it.

There is definitely some more research needed in this area. Some of my thoughts are:

On the business side I would be very interested to hear comments from SAP based on their set analysis product?

I will get back to this in a formal way, this was just to let you know.

Best regards

Thomas

Mats Rydberg wrote:

@ThomasFrisendal https://github.com/ThomasFrisendal Thank you for a very interesting proposal. I took the liberty of doing some editorial changes to your text; please object if you feel I've misconstrued something.

I'm wondering a bit about the set idea. Normally in Cypher when you do a match, like so:

FROM my.catalog.graph MATCH (a)-[r]->(b)

the result is a /binding table/ containing three columns |a|, |r| and |b|. If we imagine that the |FROM| clause was provided not a single graph but a set of graphs, will the result of the above |MATCH| then be a /set of tables/?

Assuming the answer is affirmative, what happens if we were to |RETURN|? Is the result the set of binding tables? How does this play with respect to concurrency; if the query is operating on a set of graphs, will it semantically do each operation on the full set before continuing with the next operation on the "first" member of the set? Especially in the presence of updates this becomes important to consider.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/opencypher/openCypher/issues/318#issuecomment-408349624, or mute the thread https://github.com/notifications/unsubscribe-auth/ALm64_qJn6LgEPVyEZ1-q0DapB9cwqWUks5uKs7JgaJpZM4VQtBD.

-- Sent from Postbox https://www.postbox-inc.com/?utm_source=email&utm_medium=siglink&utm_campaign=reach