Moving Elide toward supporting Knowledge Graph data CRUD

QubitPi commented 3 months ago

Elide has been used in Yahoo for nearly 10 years. Is has been used to support lots of businesses at Yahoo successfully.

One of the key cutting-edge advantage of Elide is its extremely flexible architecture that not only support SQL databases, which Elide was initially designed for, but also it has successfully extended its capability to non-SQL world such as text search, the one very similar to Elasticsearch & Apache Druid through its analytic module

As a startup that constantly focuses on knowledge graph application and technology innovation, we would like Elide to natively support the CRUD of a new type of data - graph data, this would bring at least 2 benefits

keep Paion on the cutting edge of exploring new technologies and opportunities
bring this Yahoo's classical software to Chinese tech market by the opportunity of knowledge graph tech

Ths initiative of supporting graph data by Elide will benefit company's most important product - Nexus Graph

Short-Term Goal - Storing Graph Data in SQL Database

The relationship between a graph node and link is definitely one-to-many, given that one node can have multiple outgoing/incident links.

Intuitively one might model node and link in the following way

@Entity
@Table(name = "Node")
@Include(rootLevel = false, name = "node", description = "graph node", friendlyName = "node")
public class Node {

    ...

    /**
     * All incoming and outgoing links attached to this node.
     */
    @OneToMany(cascade = CascadeType.ALL)
    @GraphQLDescription("All incoming and outgoing links attached to this node.")
    public List<Link> links;
}

@Entity
@Table(name = "Link")
@Include(rootLevel = false, name = "link", description = "graph link", friendlyName = "link")
public class Link {

    ...

    /**
     * The source node.
     */
    @ManyToOne
    @GraphQLDescription("The source node.")
    public Node source;

    /**
     * The target node.
     */
    @ManyToOne
    @GraphQLDescription("The target node.")
    public Node target;
}

There are at least 3 problems with this design

Semantically, Node.links is mappedBy both Link.source and Link.target but mappedBy can only take one of them
By experiment, there is no way to create a knowledge graph in one GraphQL mutation.
```
mutation {
  node {
     id
  }
  link(id: <-- node.id) {
     id
  }
}
```
Creating a graph with nodes and links requires the node ID, upon created, to be passed to the link so that it can reference to create source and target node.

In GraphQL, fields at each "level" of the request are executed and resolved in parallel. In this example, both node and link are fields of the same type so they will be resolved at the same time. That means each query essentially is not aware of the other or what the other resolves to.

We could make nested query but nodes and links are intrinsically not nested, because a node with a link comes with another node for sure and two nodes are definitely on the same "level" which cannot be resolved with a deterministic order.

We would have to handle this kind of scenario at client (Nexus Graph) side. In other words, create nodes first, parse the response for the node ID's and then make a second request.

An even more devastating issue is it makes them no longer @OneToMany relationship, because given that a link can connect up to 2 nodes, the owning-side (i.e. Link) of the foreign key table will have two rows with the same link ID, which is not possible. For example, suppose we have two nodes A, B with ID's 1 and 2, connected by 3 unidirectional links whose ID's are 11, 12, 13

              10              
 ┌─────────────────────────┐  
 │                         │  
┌─▼─┐                     ┌─┴─┐
│   │          11         │   │
│ A ◄─────────────────────┤ B │
│   │                     │   │
└─▲─┘                     └─┬─┘
 │            12           │  
 └─────────────────────────┘

This would result in a join table like

+---------+---------+
| node_id | link_id |
+---------+---------+
|       1 |      10 |
|       1 |      11 |
|       1 |      12 |
|       2 |      10 |
|       2 |      11 |
|       2 |      12 |
+---------+---------+

The table above presents a many-to-many relationship which breaks the one-to-many between node and links

Given the difficulty of Elide in such business case, the short-term goal is to design a JPA model of knowledge graph such that all CRUD operations against a knowledge graph can be down in one GraphQL operations

Mid-Term Goal - Storing Graph Data in Neo4J/Arango Graph Database

Elide started as a JPA web service by exposing relational data store via a relational JPA data model. Everything is relational initially. It is important to realize that Elide, later, supported non-relational data CRUD in one of the two approaches

Relational JPA data model backed by Non-relational data store

The search data store is essentially Apache Lucene, the store behind Elasticsearch, wrapped by Hibernate. Basically everything in Elide is still relational at this point
Non-relational data model backed by relational data store

As Elide continues evolving, it started supporting CRUD operation against non-SQL database via analytic query, which is commonly used in Business Intelligence applications like the one shown below:

Basically, Elide achieves this through its concept of semantic layer, which maps a relational view of data to arbitrary non-relational view for user

Working toward graph data, Paion Data will make Elide support a new and 3rd mechanism:

Non-relational data model backed by non-relational data store (Neo4J/ArangoDB)

Long-Term Goal - Unified API Layer Aggregating Heterogeneous Databases

In the long-run, Nexus Graph will need to deal with data from variety of sources, including AI-inferenced data, user generated knowledge graph data, and graph data from the world wide web. Each type of data will be stored in different types of databases. The key to make our product successfully evolve is to unify the data API for efficient data management. Elide will ultimately become the single data API aggregating arbitrarily heterogeneous data sources for this purpose

Doom9527 commented 3 months ago

A possible solution is to use a linking table to store the Link entities, which includes two foreign key fields pointing to the source and target nodes respectively. This way, you can maintain a one-to-many relationship.

QubitPi commented 3 months ago

What would GraphQL query look like in the following case

Creating a new graph in one request
Updating a new graph in one request
Fetching graph by ID
Fetching all nodes of a graph without full table scan

QubitPi commented 3 months ago

A possible solution is to use a linking table to store the Link entities, which includes two foreign key fields pointing to the source and target nodes respectively. This way, you can maintain a one-to-many relationship.

This would not be considered as a "solution" because

It's incomplete
What's your definition of a "graph"?
Does this design allow atomicity of all possible graph requests mentioned above?

I'm pretty sure you will meet trouble down the path if you include enough details.

QubitPi commented 3 months ago

It also helps to write a draft of your data model with some code snippet. That helps your audience better understand your idea

Doom9527 commented 3 months ago

The GraphQL query will like: Creating a new graph in one request

Updating a new graph in one request

mutation UpdateGraph {
  updateSource(id: "nodeAId", input: {name: "Updated Node A"}) {
    id
    name
  }
}

Fetching graph by ID

query GetGraphById($nodeId: ID!) {
  source(id: $nodeId) {
    id
    name
    links {
      id
      source {
        id
        name
      }
      target {
        id
        name
      }
    }
  }
}

Fetching all nodes of a graph without full table scan

query GetAllNodes {
  allSources {
    edges {
      node {
        id
        name
        links {
          edges {
            node {
              id
              source {
                id
                name
              }
              target {
                id
                name
              }
            }
          }
        }
      }
    }
  }
}

QubitPi commented 3 months ago

The GraphQL query will like: Creating a new graph in one request

Updating a new graph in one request

mutation UpdateGraph {
  updateSource(id: "nodeAId", input: {name: "Updated Node A"}) {
    id
    name
  }
}

Fetching graph by ID

query GetGraphById($nodeId: ID!) {
  source(id: $nodeId) {
    id
    name
    links {
      id
      source {
        id
        name
      }
      target {
        id
        name
      }
    }
  }
}

Fetching all nodes of a graph without full table scan

query GetAllNodes {
  allSources {
    edges {
      node {
        id
        name
        links {
          edges {
            node {
              id
              source {
                id
                name
              }
              target {
                id
                name
              }
            }
          }
        }
      }
    }
  }
}

In “Updating a new graph in one request”, what if we added 2 new nodes with 1 unidirectional link between then and at the same time updated the name of a 3rd node and yet also updated the label of another link?

Doom9527 commented 3 months ago

This situation like fields at each "level" of the request are executed and resolved in parallel, I can not send it in one request, I had to think about designing a new JPA model or using some other solution.

Doom9527 commented 3 months ago

I use many-to-many association tables to design the JPA model:

@Entity
@Table(name = "Node")
@Include(rootLevel = true, name = "node", description = "node entity", friendlyName = "node")
public class Node {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;

    private String name;

    @ManyToMany(mappedBy = "nodes")
    private Set<Link> links = new HashSet<>();
}

@Entity
@Table(name = "Link")
@Include(rootLevel = true, name = "link", description = "graph link", friendlyName = "link")
public class Link {
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;

    private String label;

    @ManyToMany
    @JoinTable(
            name = "NodeLink",
            joinColumns = @JoinColumn(name = "link_id"),
            inverseJoinColumns = @JoinColumn(name = "node_id")
    )
    private Set<Node> nodes = new HashSet<>();
}

The GraphQL query will look like: Creating a new graph in one request

mutation createGraph {
    link(
        op: UPSERT
        data: { label: "label", nodes: [{ name: "NodeA" }, { name: "NodeB" }] }
    ) {
        edges {
            node {
                id
                label
                nodes {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
            }
        }
    }
}

Creating two nodes in one step:

mutation createNodes {
    node1: node(op: UPSERT, data: { name: "node1" }) {
        edges {
            node {
                id
                name
            }
        }
    }
    node2: node(op: UPSERT, data: { name: "node2" }) {
        edges {
            node {
                id
                name
            }
        }
    }
}

Adding a link between them:

mutation addLink {
    link(
        op: UPSERT
        data: { label: "label", nodes: [{ id: 1 }, { id: 2 }] }
    ) {
        edges {
            node {
                id
                label
                nodes {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
            }
        }
    }
}

Doom9527 commented 3 months ago

I set the rootLevel for both Node and Link to true, and if I set the rootLevel for the former to false, I will consider adding a sentinel to the latter to create a single Node

QubitPi commented 3 months ago

The new data model looks better. Although there are still concerns that needs to be addressed:

Semantically, node and link cannot be many-to-many. A link must have exactly one source node and exactly one target node (although source and target can be the same). Take a look at the example below:
Nexus Graph needs to be able to fetch all nodes and links of a graph. In the new data model, how would we do that? Specifically, a user can have many knowledge graphs. Each graph has n nodes, m links, and a graph ID
joinColumns is not portable across database systems. This would become a big trouble for API libraries like Elide:
Another issue I saw with the query creating a new graph is: how would we reference newly created nodes ID's in one mutation? Because we would need them to determine the source and target nodes of a link. For example, if we need a unidirectional link between NodeA and NodeB, how to atomically create a link like this?

Doom9527 commented 2 months ago

I chose the initial JPA model:

@Entity
@Table(name = "Node")
@Include(rootLevel = true, name = "node", description = "node entity", friendlyName = "node")
public class Node {

    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;

    private String name;

    @OneToMany(cascade = CascadeType.ALL)
    private Set<Link> links = new HashSet<>();
}

@Entity
@Table(name = "Link")
@Include(rootLevel = true, name = "link", description = "graph link", friendlyName = "link")
public class Link {
    @Id
    @GeneratedValue(strategy = GenerationType.AUTO)
    private Long id;

    private String label;

    @ManyToOne
    private Node source;

    @ManyToOne
    private Node target;
}

We can create a new graph in one request:

mutation createGraph {
    link(
        op: UPSERT
        data: {
            label: "label"
            source: { name: "NodeA" }
            target: { name: "NodeB" }
        }
    ) {
        edges {
            node {
                id
                label
                source {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
                target {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
            }
        }
    }
}

Here we create the graph by setting Link as the root node. I think this JPA model works, it keeps the target and source. After creating the graph, we can get the ids of target and source. We could also creating a link between two nodes:

mutation addLink {
    link(
        op: UPSERT
        data: { label: "label", source: { id: 3 }, target: { id: 4 } }
    ) {
        edges {
            node {
                id
                label
                source {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
                target {
                    edges {
                        node {
                            id
                            name
                        }
                    }
                }
            }
        }
    }
}

paion-data / elide