How to do optimzation in pserver

QiJune commented 5 years ago

In TensorFlow 1.x, or similar computation graph based deep learning framework, we could serialize the model into a protobuf file. And then, we can split the file into two parts, one for worker, the other for pserver. Worker runs forward/backward part, pserver runs optimize part.

However, in TensorFlow 2.0, there is no computation graph, and no serializable model. It's just python program. So how could pserver know how to do optimization?

Besides, a trainable variable not only has its value, but also has many attributes, such as initializer/regularizer/constraint. These attributes are useful when doing optimization.

One solution is that we send the tensor value and also attributes to pserver by grpc. Then, we create a new tf.Variable on-the-fly. So we could call optimizer.apply_gradients to update the value.

But this solution has big expense:

set value and attributes in protobuf message and send out every time. Value is changing, but attributes are unchanging. It's a kind of waste.
construct a new variable every time. I am not sure if we could deserialize from a protobuf message directly to construct a new variable to avoid some memory copy work.

Let's discuss and find an efficient solution.

skydoorkai commented 5 years ago

initializer, TensorFlow automatically initializes the variables if Keras functional API is used. For Keras subclass model, the worker sends the variable initial values to PS (report_variable).
regularizer, this is used in loss, workers should take care of it.

QiJune commented 5 years ago

@skydoorkai

It seems that the problem mainly focuses on embedding table.

initializer, Should we initialize parameters lazily in pserver. Currently, we first calls lookup --> not found, then initializein worker --> then update to pserver --> lookup again.
regularizer, do we support regularization of embedding table? It seems it's hard.

shendiaomo commented 5 years ago

@skydoorkai

It seems that the problem mainly focuses on embedding table.

initializer, Should we initialize parameters lazily in pserver. Currently, we first calls lookup --> not found, then initializein worker --> then update to pserver --> lookup again.

regularizer, do we support regularization of embedding table? It seems it's hard.

The most common use case of embedding tables is in the ranking model or cadidates generation model of a large scale recommendation system. In my experience, regularization is not so necessary in this scenario. This is because the large amount of training data of a recommender system makes it hard to overfit in an SGD optimization.

On the other hand, there were some methods to implement l2 or l2 regularization for high dimension embedding tables, we can consider this later when necessary.

skydoorkai commented 5 years ago

@skydoorkai

It seems that the problem mainly focuses on embedding table.

initializer, Should we initialize parameters lazily in pserver. Currently, we first calls lookup --> not found, then initializein worker --> then update to pserver --> lookup again.

regularizer, do we support regularization of embedding table? It seems it's hard.

Currently, we do not support regularization for elasticdl embedding layer. We can add the regularization for the embedding vectors accessed in the minibatch, not sure if this would have a similar regularization effect for the embedding table.

terrytangyuan commented 5 years ago

On the other hand, there were some methods to implement l2 or l2 regularization for high dimension embedding tables, we can consider this later when necessary.

@shendiaomo Do you have any references on this?

shendiaomo commented 5 years ago

On the other hand, there were some methods to implement l2 or l2 regularization for high dimension embedding tables, we can consider this later when necessary.

@shendiaomo Do you have any references on this?

The idea is quite intuitive: we record the lastest step when the embedding of a sparse feature f updates, if f appears in a later step , we can regularize the embedding as: L1 can be done in the same way with a little more boundary condition checks.

The method above can reduce the computation cost of regularization of large scale embedding tables to an acceptable level because it doesn't need gradients. However, it may require the PS participate in the compuation.

I saw this trick in an old paper whose title I can't remember.

sql-machine-learning / elasticdl

How to do optimzation in pserver #1242