Customized model training and prediction for GCN

Derek-Wds commented 4 years ago

Hi all, I have written an initial version of GCN and here are some potential problems. I would appreciate it a lot if someone can give me a few suggestions.

Model design In the python script of the model, there exists a GCNLayer class inherited from tensorflow.keras.layers.Layer, which can be treated as a standard keras layer and be used to construct the GCN model.

The GCN class is inherited from tensorflow.keras.Model and takes in arguments: nhid (number of hidden units), nclass (number of output class), epochs (number of epochs to be trained), train_num (number of data to be used for training), eval_num (number of data to be used for evaluating), dropout (dropout rate), and nlayer (number of GCNLayer for the model).

All the default hyperparameters and architectures are kept the same as the original paper. The APIs are flexible and user can define the hyperparameters as they wish.

Training and testing Since graph data is different from other data types, it is unreasonable to load batches of data for training and testing. Thues, here are some details for training and testing of the GCN model:

We should use sqlflow_train_loop, sqlflow_evaluate_loop and sqlflow_predict_one APIs from sqlflow to train, evaluate and predict the model.
This sql command will be used to load the data from database:
```
SELECT features, labels, edge_table.from_node_id, edge_table.to_node_id FROM node_table LEFT JOIN edge_table
ON  node_table.id = edge_table.from_node_id
TO TRAIN GCN
WITH ...
LABEL ...
INTO ...
```
This is slightly different from what we have discussed in #2714 since the order of nodes we retrieved matters. When features dimension is big and degrees of the nodes are small, it will also save some resources.
The data will be loaded in the sqlflow_train_loop method of the model, after that, the preprocessed features, labels and adjacency matrix will be stored as the attribute of the model. In this way, the model can directly evaluate the performance in sqlflow_evaluate_loop by following code without loading the data again.
```
result = self.evaluate({'x': self.features, 'adj':self.adjacency}, self.labels, sample_weight=self.val_mask)
```
Since graph data should be loaded at once, the LIMIT command in sql for specify the number of data to be used is not applicable anymore, we should point this out and make sure users will not use this command during the training process. Instead, users should specify the number of points for training through the train_num argument of the model.

Potential problems However, there exist some problems of current model design:

During the prediction time (in sqlflow_predict_one), it seems that SQLFlow is only allowed to predict one sample from the database, it is quite inelegant if we want to use this function for the graph data. I'm not sure if there exist other ways for predicting batches data in an efficient way.
Since we store features, labels and adjacency matrix as the attributes of the models (aka store the dataset), the size of the parameters to be saved will be extremely large if the dataset is huge. If we don't save these data, then during the evaluate and test phase, this data should be loaded again.

One possible solution to solve the problems is: we doing training, evaluation and prediction all at once inside the sqlflow_train_loop method, and we store all the data in a table. This is because our graph data is fixed, and we use the entire adjacency matrix for training and only update parameters by masking some of the data and labels, thus, it is able for us to get the label of all the nodes in the graph as long as the model is trained to converge.

However, this may require additional process of loading the saved data in sqlflow_evaluate_loop and sqlflow_predict_one. I'm not sure how to achieve so, and I would like to hear your advice.

If there are any problems and things to be improved, please feel free to point them out. Many thanks!

lhw362950217 commented 4 years ago

I think we may just store the id-label map at the end of the train-loop as a lookup table and discard all other things, on prediction, we just get label from the lookup table. The lookup table maybe stored in a file in the save directory, like here. Our framework will auto store/load the save directory on train/prediction.
Can the super-param train_num be train_ratio?

Derek-Wds commented 4 years ago

Can the super-param train_num be train_ratio?

Yes, we could definitely set the hyperparameter to be train_ratio (float), but in this case, we won't be able to separate a validation dataset unless we also specify a parameter eval_ratio.

typhoonzero commented 4 years ago

During the prediction time (in sqlflow_predict_one), it seems that SQLFlow is only allowed to predict one sample from the database, it is quite inelegant if we want to use this function for the graph data. I'm not sure if there exist other ways for predicting batches data in an efficient way.

The sqlflow_predict_one only accepts one sample as input because when we need to do online predicting, the sample should arrive one by one. As you mentioned above, you can use write everything under sqlflow_train_loop when we set some attributes like WITH model.predict_select="SELECT * FROM ... LEFT JOIN ..." model.predict_result="db.table". These attributes can default be empty, and if they are empty, sqlflow_train_loop can do training and evaluation only ( by the way, the attribute validation.select can not be passed to sqlflow_train_loop, I'll fix that later).

Derek-Wds commented 4 years ago

Thanks for pointing this out, this seems to be a promising direction. However, I have few questions and concerns for this idea:

@typhoonzero When you said we set some attributes like WITH model.predict_select="SELECT * FROM ... LEFT JOIN ..." model.predict_result="db.table", do you mean that we take this sql command as the attribute of the model? If so, how could we execute it? (And doing so seems to require loading the entire graph dataset again).
If we run the training and evaluation both under the sqlflow_train_loop method, saving the id-label pair table as the result of all the nodes seems to be an efficient method (as @lhw362950217 proposed above). But, should we store it as an attribute of the model or should we save it as an database table. If we save it as an attribute of the model, how should we read it during prediction phase (as sqlflow does not support such operation in here following the sqlflow syntax for prediciton: SELECT ... TO PREDICT ... USING my_saved_model). If we save it as a db table, then how could we achieve so under sqlflow_train_loop.

I hope someone could help me with these. Thanks!

sql-machine-learning / models

Customized model training and prediction for GCN #78