joeyearsley / efficient_densenet_tensorflow

A memory efficient implementation of densenet.
82 stars 18 forks source link

Can this be applied to keras #1

Closed stillwaterman closed 5 years ago

stillwaterman commented 5 years ago

Great work! I am using Keras to bulid my model and I want to reduce the memory of densenet. Can this project be directly used on Keras? Could you show some Keras examples or user guide in Keras?

joeyearsley commented 5 years ago

Are you using a tensorflow backend?

stillwaterman commented 5 years ago

@joeyearsley Yes, I use tensorflow as backend

joeyearsley commented 5 years ago

So should be able to use this function still in Keras since tf.layers call tf.keras.layers under the hood.

https://github.com/joeyearsley/efficient_densenet_tensorflow/blob/eef1190478450ef6df12ce3f9d630c03eb6333dc/models/densenet_creator.py#L106

But these lines are the most important and will still work in Keras: https://github.com/joeyearsley/efficient_densenet_tensorflow/blob/eef1190478450ef6df12ce3f9d630c03eb6333dc/models/densenet_creator.py#L136-L140

stillwaterman commented 5 years ago

Thanks for reminding. Actually, I don't know how to train my model, I saw your train code which uses tensorflow style to train your model, however, I'm using fit_generator in Keras, Could you show some train code in fit_generator style? A sample code is just ok. image

stillwaterman commented 5 years ago

Sorry to disturb you, I think I can implement this by myself easily. Another small question, do we really need Horovod? Perhaps, you just use Horovod framework to build your model. Do I miss some important details about Horovod to save memory?

joeyearsley commented 5 years ago

You are correct, horovod isn’t a crucial detail.

I included horovod and mixed precision training to show how to scale to the max but in a majority of cases it will not be needed.

stillwaterman commented 5 years ago

Thanks for your reply, but I got some errors When I'm trying to use _x = tf.contrib.layers.recompute_grad(_x). For the first time, I simply add it to my _conv_block, which give rise to an error. Error Message: TypeError: All variables used by a function wrapped with @custom_gradient must be ResourceVariables. Ensure that no variable_scope is created with use_resource=False.

Then I tried to fix this problem by adding the code:with tf.variable_scope('backbone_denseblock_{}'.format(block_idx), use_resource=True), but I got another error. Error Message: AttributeError: NoneType object has no attribute _inbound_nodes.

At that time, I thought it is a problem about tensorflow function. So I tried to use Lambda to wrap it.recompute_grad_cp = Lambda(lambda dx: tf.contrib.layers.recompute_grad(dx)), _x = recompute_grad_cp(_x), but Lambda layer only accepts Keras tensor as input. Could you give me some advice about using the code: _x = tf.contrib.layers.recompute_grad(_x)?

stillwaterman commented 5 years ago

I saw a warning about this function on tensorflow: Warning: Because the function will be called again on the backwards pass, the user should be careful to not use ops in their function that mutate state or have randomness (for example, batch normalization or dropout). If the function does have such operations, it is recommended that the function take the is_recomputing keyword argument which will be False on the forward pass and True on the backwards pass so that it can disable state changes when is_recomputing=True (for example, not updating the moving averages in batch normalization).

Maybe we shouldn't include dropout and BN layers in defined function, Is that right?

joeyearsley commented 5 years ago

Ahhhh, you need to wrap it in a Keras Layer so Keras can store the in bound nodes.

https://keras.io/layers/writing-your-own-keras-layers/

However there’s a bit more to it probably than just that page.

And yes you are correct on the dropout front, I just quickly pieced it together and didn’t really use dropout.

However for batch norm I think it depends on your implementation of the update ops.

stillwaterman commented 5 years ago

Hi bro! Due to the problem of Internet disconnection, I am very sorry for not replying in time these days. I tried to wrap it in a Keras Layer and here is my code:

class Back_Recompute(Layer):
    def __init__(self, filters, kernel_size, w_decay, **kwargs):
        self.n_filters = filters
        self.we_decay = w_decay
        self.ks = kernel_size
        super(Back_Recompute, self).__init__(**kwargs)

    def call(self, ip):
        def _x(inner_ip):
            x = Conv2D(self.n_filters, self.ks, kernel_initializer=he_normal, padding=same, use_bias=False,
                       kernel_regularizer=l2(self.we_decay))(inner_ip)
            return x

        _x = tf.contrib.layers.recompute_grad(_x)

        return _x(ip)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1], input_shape[2], self.n_filters)

The good news is that I didn't meet the problem of 'NoneType object has no attribute _inbound_nodes'. The bad news is that the code tf.variable_scope('backbone_denseblock_{}’.format(block_idx),use_resource=True) doesn't seem to work anymore and the error message (All variables used by a function wrapped with @custom_gradient must be ResourceVariables. Ensure that no variable_scope is created with use_resource=False.) came back again! Do you have a solution to it? I think it is not easy for me. :(

joeyearsley commented 5 years ago

Could you try placing the variable scope inside the call function?

stillwaterman commented 5 years ago

I tried like this:

class Back_Recompute(Layer):
    def __init__(self, filters, kernel_size, w_decay, **kwargs):
        self.n_filters = filters
        self.we_decay = w_decay
        self.ks = kernel_size
        super(Back_Recompute, self).__init__(**kwargs)

    def call(self, ip):
        global brcount
        with tf.variable_scope('denseblock_{}'.format(brcount), use_resource=True):
            def _x(inner_ip):
                x = Conv2D(self.n_filters, self.ks, kernel_initializer='he_normal', padding='same', use_bias=False,
                           kernel_regularizer=l2(self.we_decay))(inner_ip)
                return x

            brcount = brcount + 1
            _x = tf.contrib.layers.recompute_grad(_x)

            return _x(ip)

    def compute_output_shape(self, input_shape):
        return (input_shape[0], input_shape[1], input_shape[2], self.n_filters)

Fortunately, I didn't meet error messages, it looks useful and the model can compile. But train program was stuck at fit_generator and there is no error message, it was just stuck.

joeyearsley commented 5 years ago

Unusual, does it work with any other fit method?

stillwaterman commented 5 years ago

I tried fit method and it was also stuck without any data fit(x=None,y=None,steps_per_epoch=1). If I don't use Back_Recompute layer, this model can start training. This is very strange

joeyearsley commented 5 years ago

Can you use some print statements or TF Prints to diagnose when it stops?

stillwaterman commented 5 years ago

Hi, I think maybe it is because tf.variable_scope cannot work on keras layer, so I changed Conv2D to tf.layers.conv2d. Then I got a error message: AttributeError: 'Activation' object has no attribute 'outbound_nodes'. Any advice about this question?

joeyearsley commented 5 years ago

Unfortunately not, could you raise this as an issue in Tensorflow?

I believe it may be due to some slight API inconsistency across layers, which is weird as tf.layers call tf.keras.layers under the hood.

Sirius083 commented 5 years ago

Thanks for the explaination for the horovod package, since it can not be installed on windows.

joeyearsley commented 5 years ago

There’s always numerous other ways to do this, like using TFs distributed estimators or implementing your own parameter server.

I’ve not used windows in years so can’t comment. Maybe take it up with the Horovod team?

Sirius083 commented 5 years ago

Thanks, I run the code on one GPU (GTX1080 ti), when the batch_size set to 3750 it reported the memory error, so I tried the original paper paramter (batch_size=512, init_lr = 0.1 and decrease at150 and 225 epoch), the result is around 88.09%. Can you tell us the final accuracy on cifar10 at batch_size 3750 Thanks in advance

RayDeeA commented 5 years ago

Any new insight on this issue? Is gradient checkpointing in tensorflow.keras 1.14 somehow possible?

ghost commented 5 years ago

I got training to work (see Stack Overflow 53568202, but I cannot load the trained model. When I do, I get the following error:

ValueError: The variables used on recompute were different than the variables originally
used. The function wrapped with @recompute_grad likley creates its own variable
scope with a default name and has been called twice in the same enclosing scope.
To fix, ensure each call to the function happens in its own unique variable
scope.

I think this is because tensorflow is executing in eager mode when using recompute_grad, so the variable scopes aren't being saved. Maybe this can be overcome using EagerVariableStore.

Does anyone know how to load a model trained using recompute_grad?

joeyearsley commented 5 years ago

@Sirius083 The aim of this repo isn't to get a certain result, it's just to show how a larger batch size can fit into memory with nothing but engineering - it doesn't even report a validation accuracy or have a test script.

@RayDeeA This should be possible however further engineering might be needed like they do in the RevBlock layer .

fangkuann commented 4 years ago

https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/base_models/bert/modeling.py#L999 It seems that the recompute_grads implemented in this repo works. Anyone else had applied this code? Hope for any explanation for the code.