[Proposal] Design proposal for adding graph data to SQLFlow database

Derek-Wds commented 4 years ago

Is your feature request related to a problem? Please describe. Hi, I'm thinking about bringing a new type of training data (graph data) to the SQLFlow database. This is significant since many real world data are non-euclidean such as graphs, and people who uses SQLFlow may encounter such data in their tasks. Deep learning models such as GCN and GAT are powerful to solve graph related problems, and they would be helpful if we include them in the library in the future. However, before we bring these models to the SQLFlow, it would be convenient to have a pre-load graph dataset such as cora in the SQLFlow database so that these models can be trained and tested easily.

This is a rough idea, and the following are some my thoughts on the solutions. It would be good if we could discuss it a bit and any suggestions are appreciated!

Describe the solution you'd like

Part I. Database schema

If we want to use graph related DL models to solve real world problems, there are two things that need to be provided: features, which are the information contained within each node in the graph, and adjacency matrix which represents the graph structure in the format of a matrix (this could be calculated by an edge list). features should be a 2-D tensor with shape (N,D) where N is the number of nodes and D is the dimension of each node's feature vector. adjacency matrix is a 2-D sparse tensor with shape (N,N).

Thus, I'm considering to have two tables in the database, which would be enough to maintain all the information we need in a graph.

Node Table: store the information of each node in one table.
```
Node Table
```

id | name | features | label

105 | node1 | "0 0 1" or [0, 0, 1] | "L1" 106 | node2 | "0 1 0" or [0, 1, 0] | "L2"

The features may be in the form of arrays or vectors, and I guess storing them with type TEXT or JSON would be efficient. 
* `Edge Table`: store the graph structures in the form of edges in one table.

Edge Table

id | from_node_id | to_node_id | weight

1 | 105 | 106 | 1.0 2 | 106 | 105 | 2.5

From my perspective, these two tables are efficient and powerful to handle most of the graph data. If you find some corner cases that make this design vulnerable, please comment below. 

#### Part II. Loading data
(I'm not familiar with how SQLFlow pass data into Python, so I skip the process of getting data from the two tables above.)
The difficult part of loading data is to build the adjacency matrix of the graph. Here are two solutions that I find to be good:
* [`Scipy`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html): Use the `Scipy` package to get the adjacency matrix. If we have a edge vector (or list of lists) `edges` with shape `(E, 2)` where `E` is the number of edges, we could build the adjacency matrix using following python script:
```Python
import numpy as np
import scipy as sp
# coo_matrix((data, (i, j)), [shape=(M, N)])
adjacency = sp.coo_matrix((np.ones(len(edges)),
                    (edges[:, 0], edges[:, 1])),
                    shape=(features.shape[0], features.shape[0]), dtype="int64")

features.shape[0] is the number of nodes (N) and the adjacency matrix adjacency has shape (N, N). The adjacency matrix is a Scipy sparse matrix in the format COO (COO is a fast format for constructing sparse matrices).

NetworkX: Use the NetworkX package to generate a graph, and then build the adjacency matrix automatically. This process can be done with following python script:
```
import networkx as nx
import scipy as sp
# use cora dataset as an example, and we are creating undirected graph. Use nx.DiGraph() for directed graph.
g = nx.Graph(edges) # edges must be list of lists, and each sublist (A,B) represents the edge between node A and B.
adjacency = nx.adjacency_matrix(G)
```
adjacency is a Scipy sparse matrix, we could convert it to the COO format by adjacency.tocoo() which will get the adjacency matrix same as above. NetworkX is a brilliant python library that can do graph and network analysis, and it has many other useful tools we may use in the future development. We could decide which method we should apply for loading the data from the database.

Additional Notes Here are some details about the dataset that I would love to add to the database:

cora dataset: The Cora dataset is about a citation network of scientific papers. It consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

If there is anything that you find valuable or needs to be improved, please let me know.Thanks!

lhw362950217 commented 4 years ago

Great idea, cora can be our demo GCN dataset~

I think the overall design of data schema is reasonable, I just have some thought about it as follows.

About the Syntax and Implementation

As for node data: SQLFLow supports comma separated strings, we just need to use SPARSE(features) or DENSE(features) in feature column clause. So, maybe we can store the features column in this format. And then write the train statement like:

SELECT * FROM my_graph_nodes 
TO TRAIN GCNModel
COLUMN DENSE(features)
LABEL class 
INTO my_gcn_model;

However, SQLFlow can't load two tables at once, so, maybe we need to specify the edge table in WITH clause. That is to say, treat the edge table as param for the training. The train logic will read the edge table and build the graph. The final statement will like:

SELECT * FROM my_graph_nodes 
TO TRAIN GCNModel 
WITH edge_table='my_edge_table',other_gcn_param='xxx' 
COLUMN DENSE(features) 
LABEL class 
INTO my_gcn_model;

Other things to consider How to read edge data when training? (Use our DB api) What is the model's ability and limitation? Can we support distributed training? How to save and load the model? Is the model explianable?

As we currently are focused on the data storage, we can take these questions as a consideration for our model implementation design later.

Derek-Wds commented 4 years ago

Nice! @lhw362950217 This seems to be a decent solution for getting the data from database.

Just to make sure I understand correctly, for getting the edge table, we treat it as a training parameter and we build the graph before feeding it into the model.

As we currently are focused on the data storage, we can take these questions as a consideration for our model implementation design later.

Yes, these things should definitely be considered when merging the model into SQLFlow. I will do some research work on the possibility of training the GCN model distributedly, since graph is quite special and we cannot simply split it into small batches as usual.

lhw362950217 commented 4 years ago

@Derek-Wds Yes, we build the graph before training. Users just specify the edge table name. We can get the table name param, and need to write a pice of code to load the graph as a pre-training action.

typhoonzero commented 4 years ago

Some advice:

The proposal should describe input/output data schema (shape, data type etc.) of the model.
How to deal with a large dataset if the edges can not be loaded into memory.

Derek-Wds commented 4 years ago

@typhoonzero Thanks for the advice! I will modify the proposal a bit and make everything more clear.

The proposal should describe input/output data schema (shape, data type etc.) of the model.

As I mentioned above that, there are two input for the model features and adjacency matrix. features should be a 2-D tensor with shape (N,D) where N is the number of nodes and D is the dimension of each node's feature vector. adjacency matrix is a 2-D sparse tensor with shape (N,N) which describes the connection between nodes.

As for the output of the model, this actually depends on the specific task. If the model is used for classification, then the output may be a tensor with shape (B,L) where B is the number of data points and L is the number of all the possible labels. As for other tasks, the output may vary a lot. For the GCN, it will be mainly used as a classification task (to predict the label of each node).

For the data in the database, the nodes and its features will be stored in the node table while the connection between nodes will be stored as a edge list in an edge table. As long as the data can be properly accessed, we could use python packages to covert them into the correct format for the model input (two methods introduced above).

How to deal with a large dataset if the edges can not be loaded into memory.

This is actually a tough problem. Since graph is different from other data like images, it is hard to split into small batches for training or testing. One possible solution for handling large graph is to use an advanced method named GraphSAGE. Instead of considering the entire graph at a single step, the algorithm focus on a small local neighborhood for learning the embeddings for a given node. This improves the generalization ability of the model and make the learning in an inductive way instead of transductive way. If the graph is too big to be loaded, GraphSAGE algorithm may be used as a tool to make the learning process more efficient. There also exists similar work named PinSAGE. (These two algorithms are quite complicated, and I haven't figured out what they are exactly doing.)

We could list this as a future work, which I believe is essential if we want to solve real world problems use GCN.

sneaxiy commented 4 years ago

@Derek-Wds How would you load the train table data if you use the SQL statement like this? Would you load the data in the model definition?

SELECT * FROM my_graph_nodes 
TO TRAIN GCNModel 
WITH edge_table='my_edge_table',other_gcn_param='xxx' 
COLUMN DENSE(features) 
LABEL class 
INTO my_gcn_model;

And how to load the test data when prediction?

Derek-Wds commented 4 years ago

I haven't fully understand how SQLFlow convert these statements into python, but from what I'm seeing now, the node data can be directly accessed while we could get the edge data using the db api. The process is similar during test phase.

The original thought is to load both table at once, but as @lhw362950217 said, SQLFlow is not capable of doing such operations. Thus, this is a promising solution up to now. If there exist other possibilities, it would be great to have a discussion here.

sneaxiy commented 4 years ago

from what I'm seeing now, the node data can be directly accessed while we could get the edge data using the db api

I mean you would load the edge data using the db API in the model code? It may not be a nice way.

Derek-Wds commented 4 years ago

No, the model code will not contain anything that is related to data processing. The db api will be used before we feed the data into the model.

sneaxiy commented 4 years ago

@Derek-Wds SQLFlow would only load the data from the standard select statement before TO TRAIN syntax. That is to say, SQLFlow would only load the data from SELECT * FROM my_graph_nodes in the following SQL statement. The edge data indicated in WITH edge_table='my_edge_table' would not be loaded automatically.

SELECT * FROM my_graph_nodes 
TO TRAIN GCNModel 
WITH edge_table='my_edge_table',other_gcn_param='xxx' 
COLUMN DENSE(features) 
LABEL class 
INTO my_gcn_model;

Derek-Wds commented 4 years ago

Thanks for pointing it out! I wonder if it is possible to modify the logic behind WITH statement: when the parameter edge_table comes in, we add some codes of selecting a table again from the database through the db api, while everything else is the same. With the parameter edge_table we will know that we are loading graph data, and the process of building the adjacency matrix will be done as well.

I have considered other ways of storing the data such as storing the data in one edge table, but it turns out that it will take more space since we may have to keep the features of the node at the same time. Besides, if doing so, we should have additional functions to figure out if the data is a graph and decide whether to compute the adjacency matrix. This is quite inefficient and complex.

typhoonzero commented 4 years ago

I think "modify the logic behind WITH statement" may let us write code specified for the GCN model.

I suggest we can write the statement:

SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table LEFT JOIN my_graph_nodes
ON my_edge_table.from_node_id = my_graph_nodes.id
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...

Then in the model implementation, we can write a custom train loop using the interface sqlflow_train_loop and generate feature matrix and adjacency matrix, then start training. See https://github.com/sql-machine-learning/models/blob/develop/sqlflow_models/deep_embedding_cluster.py#L231 for more information.

Derek-Wds commented 4 years ago

Thanks, this seems to be a better direction to work on. I will take a look on the link and see if it is possible to do so.

Derek-Wds commented 4 years ago

Conclusion We decided to apply @typhoonzero 's idea of getting the data (use JOIN on the tables). It is more flexible and won't change SQLFlow's logic much. I will work on this and hope to make a related PR soon. Thanks!

Derek-Wds commented 4 years ago

Update

SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table LEFT JOIN my_graph_nodes
ON my_edge_table.from_node_id = my_graph_nodes.id
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...

For loading data, this command will miss out some node that are treated as the end node of the edge. Thus, in order to build the bi-directional edge of the graph, we shall use this command instead:

SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table INNER JOIN my_graph_nodes
ON (my_edge_table.from_node_id = my_graph_nodes.id OR my_edge_table.to_node_id = my_graph_nodes.id)
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...

sql-machine-learning / sqlflow