sql-machine-learning / models

Premade models for SQLFlow
Apache License 2.0
37 stars 17 forks source link

Customized model training and prediction for GCN #78

Closed Derek-Wds closed 4 years ago

Derek-Wds commented 4 years ago

Hi all, I have written an initial version of GCN and here are some potential problems. I would appreciate it a lot if someone can give me a few suggestions.

Model design In the python script of the model, there exists a GCNLayer class inherited from tensorflow.keras.layers.Layer, which can be treated as a standard keras layer and be used to construct the GCN model.

The GCN class is inherited from tensorflow.keras.Model and takes in arguments: nhid (number of hidden units), nclass (number of output class), epochs (number of epochs to be trained), train_num (number of data to be used for training), eval_num (number of data to be used for evaluating), dropout (dropout rate), and nlayer (number of GCNLayer for the model).

All the default hyperparameters and architectures are kept the same as the original paper. The APIs are flexible and user can define the hyperparameters as they wish.

Training and testing Since graph data is different from other data types, it is unreasonable to load batches of data for training and testing. Thues, here are some details for training and testing of the GCN model:

Potential problems However, there exist some problems of current model design:

One possible solution to solve the problems is: we doing training, evaluation and prediction all at once inside the sqlflow_train_loop method, and we store all the data in a table. This is because our graph data is fixed, and we use the entire adjacency matrix for training and only update parameters by masking some of the data and labels, thus, it is able for us to get the label of all the nodes in the graph as long as the model is trained to converge.

However, this may require additional process of loading the saved data in sqlflow_evaluate_loop and sqlflow_predict_one. I'm not sure how to achieve so, and I would like to hear your advice.

If there are any problems and things to be improved, please feel free to point them out. Many thanks!

lhw362950217 commented 4 years ago
  1. I think we may just store the id-label map at the end of the train-loop as a lookup table and discard all other things, on prediction, we just get label from the lookup table. The lookup table maybe stored in a file in the save directory, like here. Our framework will auto store/load the save directory on train/prediction.

  2. Can the super-param train_num be train_ratio?

Derek-Wds commented 4 years ago
  1. Can the super-param train_num be train_ratio?

Yes, we could definitely set the hyperparameter to be train_ratio (float), but in this case, we won't be able to separate a validation dataset unless we also specify a parameter eval_ratio.

typhoonzero commented 4 years ago

During the prediction time (in sqlflow_predict_one), it seems that SQLFlow is only allowed to predict one sample from the database, it is quite inelegant if we want to use this function for the graph data. I'm not sure if there exist other ways for predicting batches data in an efficient way.

The sqlflow_predict_one only accepts one sample as input because when we need to do online predicting, the sample should arrive one by one. As you mentioned above, you can use write everything under sqlflow_train_loop when we set some attributes like WITH model.predict_select="SELECT * FROM ... LEFT JOIN ..." model.predict_result="db.table". These attributes can default be empty, and if they are empty, sqlflow_train_loop can do training and evaluation only ( by the way, the attribute validation.select can not be passed to sqlflow_train_loop, I'll fix that later).

Derek-Wds commented 4 years ago

Thanks for pointing this out, this seems to be a promising direction. However, I have few questions and concerns for this idea:

I hope someone could help me with these. Thanks!