Open chunyang-wen opened 5 years ago
I agree that those arguments are not ideal and scalable. I remember these were suggested by the high-level API design here so users can configure different models, inputs, optimizer, etc. cc @wangkuiyi @ywskycn @skydoorkai
I am trying to understand https://github.com/wangkuiyi/elasticdl/issues/1006#issue-478372383. This follows my understanding.
Currently, the function elasticdl.train
requires the following parameters:
model_def
: the model definition as a tf.keras.Model
-derived class or a function that returns the inputs and outputs layers,optimizer
: a function that returns a TensorFlow optimizer,loss
: a function that returns the loss value given a minibatch of outputs and truth,eval_metrics
: a function that (I don't know eval_metrics_fn and have no idea how to describe it), anddataset
: a function that creates a tf.data.Dataset
instance.The above bullets form the settings of a training job. This inspires us to give them a single name and we can pass it as a single parameter to elasticdl.train
.
We can define them as classes and/or functions in a module and refer to the bullets as the module name. However, in some cases, we might need to define variables for information sharing or exchange between these bullets. If we use a module name, we'd have to define these variables as global variables in the module. However, global variables are often problematic.
An alternative is to define the bullets as nested-classes and/or methods of a class, so we can define the variables as class members. We prefer this way.
To better describe the idea, let us have an example. Suppose that we want to train a model defined as a tf.keras.Model
-derived class EasyAndHappy
using the MNIST dataset, we can define the following class:
class EasyAndHappyTrainerWithMNIST(elasticdl.Training):
class EasyAndHappy(tf.keras.Model):
def __init__():
create_some_parameters_here()
def call(...):
do_something_here()
def dataset():
train, test = tf.keras.datasets.mnist.load_data()
return train
def cost(...):
return_some_cost()
@wangkuiyi Just a nit pick here: dataset()
should probably be separated from the model.
Is it necessary that creating a separate Spec
for ElasticDL?
The short answer is yes.
First let's review the example in ElasticDL's model zoo we mentioned during last meeting: model_zoo/deepfm_functional_api/deepfm_functional_api.py.
In that module, we need to create a global variable named AUC_metric
in order to share information between our model and our metric function. In the metric function, it will use AUC_metric
to calculate auc. Global variables sometimes mean a bad design.
AUC_metric
is an instance of tf.keras.metrics.AUC
. It has three member functions that manipulates its data:
update_state
: Update its internal state (variables that this metric creates)reset
: Reset its internal state.result
: Get the final result of this metric.When we create an instance of tf.keras.metrics.AUC
, we continuously update its state and call result()
to get the final result. So we have to reuse the same instance in the process of a single evaluation (processing a complete loop of evaluation dataset).
We create the function we use to calculate metrics in a function like create_metrics_fn
. The function will use metrics such as tf.keras.metrics.AUC
to calculate metric. Then ElasticDL will keep calling the function in evaluation_process
when our worker receives evaluation tasks. Two potential problems can happen in evaluation_process
function:
evaluation_process
with tf.function
: It will raise Error
. Because we can not create variable in the same function during multiple calls. Tensorflow Issueevaluation_process
with tf.function
, each time we call the function, we create new variables, which will not accumulate the data. The final result is not correct.Take products recommendation as another example. Users initialize a weight
vector for different product types and they also want to calculate a weighted loss according to this weight vector. It means that we have to share weight
between loss
and create_model
functions. If we do not introduce a new Spec
class, we also have to use global variables to share this weight vector.
Notice
Current model evaluation process is not correct for metrics which we cannot average. For auc metrics, each worker will calculate auc based on tasks it receives and report auc. Master will collect all those metrics and average them. It is not correct. If we still use tf.keras.metrics.AUC
to calculate auc metric, we have to update evaluation:
tf.keras.metrics.AUC
once for all the data. We do not have to aggregate them.tf.keras.metrics.AUC
at the beginning, and calls update_state
when it receives predictions and labels (It does not need all the predictions and labels to start calculation). When we finish iterating evaluation data, we get the final result through result()
.If we agree on previous update, for the case of tf.keras.metrics.AUC
, we do not need Spec
. But there are other situations that it is easy to to share variables by introducing a Spec
class without using global variables.
Currently in order to train models, ElasticDL training worker requires following command line parameters:
model_def
:tf.keras.Model
such asmnist_module.MnistModel
.tf.keras.Model
instance such asmnist_module.create_model
optimizer
: A function that returns a tensorflow optimizer such astf.train.GradientDescentOptimizer
loss
: A function that returns a tensor which represents loss.eval_metrics_fn
: A function that returns a dict. The key of the dict is the metric's name and the value is the metric's cooresponding tensor.dataset_fn
: A function that returns atf.data.Dataset
. It is the input of training.Those parameters are names of function in the module where we define
model_def
. ElasticDL uses those functions in different stages of training or evaluation. Those functions are defined as module functions. It has following limitations:As a result, we prefer a unified class to encapsulate them. Let's call it
TrainSpec
.Users define
MnistTrainSpec
which inherits fromTrainSpec
. We will have at least two benefits:NotImplementedError
at the first time we try to instantiate the aboveMnistTrainSpec
.Another design of
TrainSpec
The above
TrainSpec
assumes that differentcreate_*
functions are isolated. If we need to share information between those functions, we have to put them in the constructor. Another drawback is that multiple calls to the samecreate_*
functions may return different instances. For examplecreate_model
will return two instances oftf.keras.Model
if we call it twice.Why we run into this dilemma ? Because we have done too more work for users. Users themselves have better understanding of their model/loss/optimizer. They know when to create an optimizer, what information to share between different functions such as loss and model.
What ElasticDL cares is only the results:
So we redefine
TrainSpec
as an abstract base class with only abstract properties. If we want to train a mnist model, we can provide a classMnistSpec
. It implements all the abstract properties. Users can use functional API or subclass to implement all those properties and return them.Users can create model/optimizer/loss_fn/dataset_fn anywhere. The above code defines variable used by property directly in
MnistSpec
's constructor.This new design will give us the same instance even if we access any property multiple times.