sql-machine-learning / elasticdl

Kubernetes-native Deep Learning Framework
https://elasticdl.org
MIT License
732 stars 113 forks source link

Move training and evaluation related functions from module functions to member functions #1006

Open chunyang-wen opened 5 years ago

chunyang-wen commented 5 years ago

Currently in order to train models, ElasticDL training worker requires following command line parameters:

Those parameters are names of function in the module where we define model_def. ElasticDL uses those functions in different stages of training or evaluation. Those functions are defined as module functions. It has following limitations:

As a result, we prefer a unified class to encapsulate them. Let's call it TrainSpec.

class TrainSpec(object):
      def __init__(self, context=None):
          """
          Args:
              context: a dict of contextual information from ElasticDL,
                             such as worker_id, master ip.
          """
          self._context = context

     @abstractmethod
     def create_optimizer(self, lr=0.1):
         """returns a tensorflow optimizer such as `tf.train.GradientDescentOptimizer`"""
         pass

     @abstractmethod
     def create_loss(self, outputs, labels):
        """returns a tensor represents loss using outputs and labels"""
        pass

     @abstractmethod
     def create_metrics(self,
                         mode=Mode.TRAINING,
                         outputs=None,
                         labels=None,
                         predictions=None,):
          """returns a dict of metrics"""
          pass

     @abstractmethod
     def create_model(self):
          """returns model created
          Users can use functional API or subclass to create customized Keras model.
          """

class MnistTrainSpec(TrainSpec):
    def create_optimizer(self, lr=0.1):
        return tf.trian.GradientDescentOptimizer(0.1)
    # other functions are left undefined purposely.

Users define MnistTrainSpec which inherits from TrainSpec. We will have at least two benefits:

Another design of TrainSpec

The above TrainSpec assumes that different create_* functions are isolated. If we need to share information between those functions, we have to put them in the constructor. Another drawback is that multiple calls to the same create_* functions may return different instances. For example create_model will return two instances of tf.keras.Model if we call it twice.

Why we run into this dilemma ? Because we have done too more work for users. Users themselves have better understanding of their model/loss/optimizer. They know when to create an optimizer, what information to share between different functions such as loss and model.

What ElasticDL cares is only the results:

So we redefine TrainSpec as an abstract base class with only abstract properties. If we want to train a mnist model, we can provide a class MnistSpec. It implements all the abstract properties. Users can use functional API or subclass to implement all those properties and return them.

import abc
class TrainSpec(abc.ABC):
     def __init__(self, context=None):
     """
     Args:
         context: A dict of contextual information from ElasticDL, such as worker_id, master ip
     """
     self._context = context or {}

     @property
     @abstractmethod
    def optimizer(self):
        """returns an instance of tensorflow optimizer, such as `tf.train.GradientDescentOptimizer`"""

     @property
     @abstractmethod
    def loss_fn(self):
        """returns a loss function with parameters of `outputs` and `label`
       def loss_fn(outputs, labels):
              return tf.losses.mean_squared_error(labels, outputs)
       You can return  a function like`loss_fn`
        """
       pass

     @property
     @abstractmethod
      def model(self):
        """returns an instance of `tf.keras.Model` or instance of its subclass"""
       pass

     @property
     @abstractmethod
     def dataset_fn(self):
     """returns a function that generates a tf.data.Dataset instance
     def dataset_fn(files):
         # blabla
         return dataset
     You can return a dataset function like `dataset_fn`.
     """

     @property
     @abstractmethod
     def metrics_fn(self):
        """returns an metric function with parameters
         def metrics_fn(self,
                         mode=Mode.TRAINING,
                         outputs=None,
                         labels=None,
                         predictions=None,):
             if mode == Mode.EVALUATION:
                 return {"mse": tf.metrics.accuracy(labels, predictions)}
             else:
                 return {}
         You can return a function like metrics_fn
        """
        pass

class MnistModel(tf.keras.Model);
     def call(self, inputs, training=False):
         pass

def _loss_fn(outputs, labels):
    retrurn tf.losses.mean_square_error(labels, outputs)

class MnistSpec(TrainSpec):
      def __init__(self, context=None):
          super(MnistSpec, self).__init__(context=context)
          self._optimizer = tf.train.GradientOptimizer()
          self._model = MnistModel()

      @property
      def optimizer(self):
           return self._optimizer

      @property
      def model(self):
           return self._model

      @property
      def loss_fn(self)
          return _loss_fn
      # other properties are left undefined purposely.

Users can create model/optimizer/loss_fn/dataset_fn anywhere. The above code defines variable used by property directly in MnistSpec's constructor.

This new design will give us the same instance even if we access any property multiple times.

terrytangyuan commented 5 years ago

I agree that those arguments are not ideal and scalable. I remember these were suggested by the high-level API design here so users can configure different models, inputs, optimizer, etc. cc @wangkuiyi @ywskycn @skydoorkai

wangkuiyi commented 5 years ago

I am trying to understand https://github.com/wangkuiyi/elasticdl/issues/1006#issue-478372383. This follows my understanding.

Currently, the function elasticdl.train requires the following parameters:

The above bullets form the settings of a training job. This inspires us to give them a single name and we can pass it as a single parameter to elasticdl.train.

We can define them as classes and/or functions in a module and refer to the bullets as the module name. However, in some cases, we might need to define variables for information sharing or exchange between these bullets. If we use a module name, we'd have to define these variables as global variables in the module. However, global variables are often problematic.

An alternative is to define the bullets as nested-classes and/or methods of a class, so we can define the variables as class members. We prefer this way.

To better describe the idea, let us have an example. Suppose that we want to train a model defined as a tf.keras.Model-derived class EasyAndHappy using the MNIST dataset, we can define the following class:

class EasyAndHappyTrainerWithMNIST(elasticdl.Training):
    class EasyAndHappy(tf.keras.Model):
        def __init__():
            create_some_parameters_here()
        def call(...):
            do_something_here()

    def dataset():
        train, test = tf.keras.datasets.mnist.load_data()
        return train

    def cost(...):
        return_some_cost()
terrytangyuan commented 5 years ago

@wangkuiyi Just a nit pick here: dataset() should probably be separated from the model.

chunyang-wen commented 5 years ago

Is it necessary that creating a separate Spec for ElasticDL?

The short answer is yes.

First let's review the example in ElasticDL's model zoo we mentioned during last meeting: model_zoo/deepfm_functional_api/deepfm_functional_api.py.

In that module, we need to create a global variable named AUC_metric in order to share information between our model and our metric function. In the metric function, it will use AUC_metric to calculate auc. Global variables sometimes mean a bad design.

AUC_metric is an instance of tf.keras.metrics.AUC. It has three member functions that manipulates its data:

When we create an instance of tf.keras.metrics.AUC, we continuously update its state and call result() to get the final result. So we have to reuse the same instance in the process of a single evaluation (processing a complete loop of evaluation dataset).

We create the function we use to calculate metrics in a function like create_metrics_fn. The function will use metrics such as tf.keras.metrics.AUC to calculate metric. Then ElasticDL will keep calling the function in evaluation_process when our worker receives evaluation tasks. Two potential problems can happen in evaluation_process function:

Take products recommendation as another example. Users initialize a weight vector for different product types and they also want to calculate a weighted loss according to this weight vector. It means that we have to share weight between loss and create_model functions. If we do not introduce a new Spec class, we also have to use global variables to share this weight vector.

Notice Current model evaluation process is not correct for metrics which we cannot average. For auc metrics, each worker will calculate auc based on tasks it receives and report auc. Master will collect all those metrics and average them. It is not correct. If we still use tf.keras.metrics.AUC to calculate auc metric, we have to update evaluation:

If we agree on previous update, for the case of tf.keras.metrics.AUC, we do not need Spec. But there are other situations that it is easy to to share variables by introducing a Spec class without using global variables.