Closed Derek-Wds closed 4 years ago
I think we may just store the id-label map at the end of the train-loop as a lookup table and discard all other things, on prediction, we just get label from the lookup table. The lookup table maybe stored in a file in the save
directory, like here. Our framework will auto store/load the save directory on train/prediction.
Can the super-param train_num
be train_ratio
?
- Can the super-param
train_num
betrain_ratio
?
Yes, we could definitely set the hyperparameter to be train_ratio
(float), but in this case, we won't be able to separate a validation dataset unless we also specify a parameter eval_ratio
.
During the prediction time (in sqlflow_predict_one), it seems that SQLFlow is only allowed to predict one sample from the database, it is quite inelegant if we want to use this function for the graph data. I'm not sure if there exist other ways for predicting batches data in an efficient way.
The sqlflow_predict_one
only accepts one sample as input because when we need to do online predicting, the sample should arrive one by one. As you mentioned above, you can use write everything under sqlflow_train_loop
when we set some attributes like WITH model.predict_select="SELECT * FROM ... LEFT JOIN ..." model.predict_result="db.table"
. These attributes can default be empty, and if they are empty, sqlflow_train_loop
can do training and evaluation only ( by the way, the attribute validation.select can not be passed to sqlflow_train_loop
, I'll fix that later).
Thanks for pointing this out, this seems to be a promising direction. However, I have few questions and concerns for this idea:
we set some attributes like WITH model.predict_select="SELECT * FROM ... LEFT JOIN ..." model.predict_result="db.table"
, do you mean that we take this sql command as the attribute of the model? If so, how could we execute it? (And doing so seems to require loading the entire graph dataset again).sqlflow_train_loop
method, saving the id-label pair table as the result of all the nodes seems to be an efficient method (as @lhw362950217 proposed above). But, should we store it as an attribute of the model or should we save it as an database table. If we save it as an attribute of the model, how should we read it during prediction phase (as sqlflow does not support such operation in here following the sqlflow syntax for prediciton: SELECT ... TO PREDICT ... USING my_saved_model
). If we save it as a db table, then how could we achieve so under sqlflow_train_loop
.I hope someone could help me with these. Thanks!
Hi all, I have written an initial version of GCN and here are some potential problems. I would appreciate it a lot if someone can give me a few suggestions.
Model design In the python script of the model, there exists a
GCNLayer
class inherited fromtensorflow.keras.layers.Layer
, which can be treated as a standard keras layer and be used to construct the GCN model.The
GCN
class is inherited fromtensorflow.keras.Model
and takes in arguments:nhid
(number of hidden units),nclass
(number of output class),epochs
(number of epochs to be trained),train_num
(number of data to be used for training),eval_num
(number of data to be used for evaluating),dropout
(dropout rate), andnlayer
(number ofGCNLayer
for the model).All the default hyperparameters and architectures are kept the same as the original paper. The APIs are flexible and user can define the hyperparameters as they wish.
Training and testing Since graph data is different from other data types, it is unreasonable to load batches of data for training and testing. Thues, here are some details for training and testing of the GCN model:
sqlflow_train_loop
,sqlflow_evaluate_loop
andsqlflow_predict_one
APIs from sqlflow to train, evaluate and predict the model.This is slightly different from what we have discussed in #2714 since the order of nodes we retrieved matters. When features dimension is big and degrees of the nodes are small, it will also save some resources.
sqlflow_train_loop
method of the model, after that, the preprocessedfeatures
,labels
andadjacency matrix
will be stored as the attribute of the model. In this way, the model can directly evaluate the performance insqlflow_evaluate_loop
by following code without loading the data again.LIMIT
command in sql for specify the number of data to be used is not applicable anymore, we should point this out and make sure users will not use this command during the training process. Instead, users should specify the number of points for training through thetrain_num
argument of the model.Potential problems However, there exist some problems of current model design:
sqlflow_predict_one
), it seems that SQLFlow is only allowed to predict one sample from the database, it is quite inelegant if we want to use this function for the graph data. I'm not sure if there exist other ways for predicting batches data in an efficient way.features
,labels
andadjacency matrix
as the attributes of the models (aka store the dataset), the size of the parameters to be saved will be extremely large if the dataset is huge. If we don't save these data, then during the evaluate and test phase, this data should be loaded again.One possible solution to solve the problems is: we doing training, evaluation and prediction all at once inside the
sqlflow_train_loop
method, and we store all the data in a table. This is because our graph data is fixed, and we use the entire adjacency matrix for training and only update parameters by masking some of the data and labels, thus, it is able for us to get the label of all the nodes in the graph as long as the model is trained to converge.However, this may require additional process of loading the saved data in
sqlflow_evaluate_loop
andsqlflow_predict_one
. I'm not sure how to achieve so, and I would like to hear your advice.If there are any problems and things to be improved, please feel free to point them out. Many thanks!