Closed Derek-Wds closed 4 years ago
Great idea, cora
can be our demo GCN dataset~
I think the overall design of data schema is reasonable, I just have some thought about it as follows.
About the Syntax and Implementation
As for node data: SQLFLow supports comma separated strings, we just need to use SPARSE(features) or DENSE(features) in feature column clause. So, maybe we can store the features column in this format. And then write the train statement like:
SELECT * FROM my_graph_nodes
TO TRAIN GCNModel
COLUMN DENSE(features)
LABEL class
INTO my_gcn_model;
However, SQLFlow can't load two tables at once, so, maybe we need to specify the edge table in WITH
clause. That is to say, treat the edge table
as param for the training. The train logic will read the edge table
and build the graph. The final statement will like:
SELECT * FROM my_graph_nodes
TO TRAIN GCNModel
WITH edge_table='my_edge_table',other_gcn_param='xxx'
COLUMN DENSE(features)
LABEL class
INTO my_gcn_model;
Other things to consider
How to read edge data
when training? (Use our DB api)
What is the model's ability and limitation?
Can we support distributed training?
How to save and load the model?
Is the model explianable?
As we currently are focused on the data storage, we can take these questions as a consideration for our model implementation design later.
Nice! @lhw362950217 This seems to be a decent solution for getting the data from database.
Just to make sure I understand correctly, for getting the edge table
, we treat it as a training parameter and we build the graph before feeding it into the model.
As we currently are focused on the data storage, we can take these questions as a consideration for our model implementation design later.
Yes, these things should definitely be considered when merging the model into SQLFlow. I will do some research work on the possibility of training the GCN model distributedly, since graph is quite special and we cannot simply split it into small batches as usual.
@Derek-Wds Yes, we build the graph before training. Users just specify the edge table
name. We can get the table name param, and need to write a pice of code to load the graph as a pre-training action.
Some advice:
@typhoonzero Thanks for the advice! I will modify the proposal a bit and make everything more clear.
- The proposal should describe input/output data schema (shape, data type etc.) of the model.
As I mentioned above that, there are two input for the model features
and adjacency matrix
. features
should be a 2-D tensor with shape (N,D)
where N
is the number of nodes and D
is the dimension of each node's feature vector. adjacency matrix
is a 2-D sparse tensor with shape (N,N)
which describes the connection between nodes.
As for the output of the model, this actually depends on the specific task. If the model is used for classification, then the output may be a tensor with shape (B,L)
where B
is the number of data points and L
is the number of all the possible labels. As for other tasks, the output may vary a lot. For the GCN, it will be mainly used as a classification task (to predict the label of each node).
For the data in the database, the nodes and its features will be stored in the node table while the connection between nodes will be stored as a edge list in an edge table. As long as the data can be properly accessed, we could use python packages to covert them into the correct format for the model input (two methods introduced above).
- How to deal with a large dataset if the edges can not be loaded into memory.
This is actually a tough problem. Since graph is different from other data like images, it is hard to split into small batches for training or testing. One possible solution for handling large graph is to use an advanced method named GraphSAGE
. Instead of considering the entire graph at a single step, the algorithm focus on a small local neighborhood for learning the embeddings for a given node. This improves the generalization ability of the model and make the learning in an inductive way instead of transductive way. If the graph is too big to be loaded, GraphSAGE
algorithm may be used as a tool to make the learning process more efficient. There also exists similar work named PinSAGE
. (These two algorithms are quite complicated, and I haven't figured out what they are exactly doing.)
We could list this as a future work, which I believe is essential if we want to solve real world problems use GCN.
@Derek-Wds How would you load the train table data if you use the SQL statement like this? Would you load the data in the model definition?
SELECT * FROM my_graph_nodes
TO TRAIN GCNModel
WITH edge_table='my_edge_table',other_gcn_param='xxx'
COLUMN DENSE(features)
LABEL class
INTO my_gcn_model;
And how to load the test data when prediction?
I haven't fully understand how SQLFlow convert these statements into python, but from what I'm seeing now, the node data can be directly accessed while we could get the edge data using the db api. The process is similar during test phase.
The original thought is to load both table at once, but as @lhw362950217 said, SQLFlow is not capable of doing such operations. Thus, this is a promising solution up to now. If there exist other possibilities, it would be great to have a discussion here.
from what I'm seeing now, the node data can be directly accessed while we could get the edge data using the db api
I mean you would load the edge data using the db API in the model code? It may not be a nice way.
No, the model code will not contain anything that is related to data processing. The db api will be used before we feed the data into the model.
@Derek-Wds SQLFlow would only load the data from the standard select statement before TO TRAIN
syntax. That is to say, SQLFlow would only load the data from SELECT * FROM my_graph_nodes
in the following SQL statement. The edge data indicated in WITH edge_table='my_edge_table'
would not be loaded automatically.
SELECT * FROM my_graph_nodes
TO TRAIN GCNModel
WITH edge_table='my_edge_table',other_gcn_param='xxx'
COLUMN DENSE(features)
LABEL class
INTO my_gcn_model;
Thanks for pointing it out! I wonder if it is possible to modify the logic behind WITH
statement: when the parameter edge_table
comes in, we add some codes of selecting a table again from the database through the db api, while everything else is the same. With the parameter edge_table
we will know that we are loading graph data, and the process of building the adjacency matrix will be done as well.
I have considered other ways of storing the data such as storing the data in one edge table
, but it turns out that it will take more space since we may have to keep the features of the node at the same time. Besides, if doing so, we should have additional functions to figure out if the data is a graph and decide whether to compute the adjacency matrix. This is quite inefficient and complex.
I think "modify the logic behind WITH statement" may let us write code specified for the GCN model.
I suggest we can write the statement:
SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table LEFT JOIN my_graph_nodes
ON my_edge_table.from_node_id = my_graph_nodes.id
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...
Then in the model implementation, we can write a custom train loop using the interface sqlflow_train_loop
and generate feature matrix and adjacency matrix, then start training. See https://github.com/sql-machine-learning/models/blob/develop/sqlflow_models/deep_embedding_cluster.py#L231 for more information.
Thanks, this seems to be a better direction to work on. I will take a look on the link and see if it is possible to do so.
Conclusion
We decided to apply @typhoonzero 's idea of getting the data (use JOIN
on the tables). It is more flexible and won't change SQLFlow's logic much. I will work on this and hope to make a related PR soon. Thanks!
Update
SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table LEFT JOIN my_graph_nodes
ON my_edge_table.from_node_id = my_graph_nodes.id
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...
For loading data, this command will miss out some node that are treated as the end node of the edge. Thus, in order to build the bi-directional edge of the graph, we shall use this command instead:
SELECT from_node_id, to_node_id, weight, my_graph_nodes.features FROM my_edge_table INNER JOIN my_graph_nodes
ON (my_edge_table.from_node_id = my_graph_nodes.id OR my_edge_table.to_node_id = my_graph_nodes.id)
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...
Is your feature request related to a problem? Please describe. Hi, I'm thinking about bringing a new type of training data (graph data) to the SQLFlow database. This is significant since many real world data are non-euclidean such as graphs, and people who uses SQLFlow may encounter such data in their tasks. Deep learning models such as GCN and GAT are powerful to solve graph related problems, and they would be helpful if we include them in the library in the future. However, before we bring these models to the SQLFlow, it would be convenient to have a pre-load graph dataset such as
cora
in the SQLFlow database so that these models can be trained and tested easily.This is a rough idea, and the following are some my thoughts on the solutions. It would be good if we could discuss it a bit and any suggestions are appreciated!
Describe the solution you'd like
Part I. Database schema
If we want to use graph related DL models to solve real world problems, there are two things that need to be provided:
features
, which are the information contained within each node in the graph, andadjacency matrix
which represents the graph structure in the format of a matrix (this could be calculated by anedge list
).features
should be a 2-D tensor with shape(N,D)
whereN
is the number of nodes andD
is the dimension of each node's feature vector.adjacency matrix
is a 2-D sparse tensor with shape(N,N)
.Thus, I'm considering to have two tables in the database, which would be enough to maintain all the information we need in a graph.
Node Table
: store the information of each node in one table.id | name | features | label
105 | node1 | "0 0 1" or [0, 0, 1] | "L1" 106 | node2 | "0 1 0" or [0, 1, 0] | "L2"
Edge Table
id | from_node_id | to_node_id | weight
1 | 105 | 106 | 1.0 2 | 106 | 105 | 2.5
features.shape[0]
is the number of nodes (N
) and the adjacency matrixadjacency
has shape(N, N)
. The adjacency matrix is aScipy
sparse matrix in the format COO (COO is a fast format for constructing sparse matrices).NetworkX
: Use theNetworkX
package to generate a graph, and then build the adjacency matrix automatically. This process can be done with following python script:adjacency
is aScipy
sparse matrix, we could convert it to the COO format byadjacency.tocoo()
which will get the adjacency matrix same as above.NetworkX
is a brilliant python library that can do graph and network analysis, and it has many other useful tools we may use in the future development. We could decide which method we should apply for loading the data from the database.Additional Notes Here are some details about the dataset that I would love to add to the database:
cora
dataset: The Cora dataset is about a citation network of scientific papers. It consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.If there is anything that you find valuable or needs to be improved, please let me know.Thanks!