Feature Request : Stochastic Depth/ResDrop

MHStadler commented 4 years ago

Describe the feature and the current behavior/state. Stochastic depth is a regularization technique to train very deep residual networks (the authors for example train ResNets of up to 1202 layers). Particularly, it allows training of, on average, more shallow networks, that retain their full depth at inference time.

Previous attempts (https://github.com/tensorflow/addons/issues/626) seem to have been unnecessarily convoluted. This implemenation would be a simple layer, attached to the end of a residual branch (as suggested here https://github.com/tensorflow/tensorflow/issues/8817#issuecomment-290788333, which is also the way it is described in the Shake-Drop paper: https://arxiv.org/abs/1802.02375)

Noteably this means that, if momentum is being used, the dropped layers will still receive small updates due to their historic gradients (similar to Dropout).

Relevant information

Are you willing to contribute it (yes/no): Yes
Are you willing to maintain it going forward? (yes/no): Yes
Is there a relevant academic paper? (if so, where): Yes (https://arxiv.org/abs/1603.09382)
Is there already an implementation in another framework? (if so, where): The authors original pytorch implementation is available here: https://github.com/yueatsprograms/Stochastic_Depth
Was it part of tf.contrib? (if so, where): No

Which API type would this fall under (layer, metric, optimizer, etc.) layer Who will benefit with this feature? Anybody who wants to use stochastic depth to train deeper ResNets, or who wants to recreate the EfficientNet architecture (https://arxiv.org/abs/1905.11946). Anyone who wants to add Shake-Drop (https://arxiv.org/abs/1802.02375) to their network, can use this as base. Any other info. It is important to note that Stochastic Depth is not the same as Dropout with noise_shape=(1, 1, 1), as suggested in the tensorflow EfficientNet implementation (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/applications/efficientnet.py#L511). Using dropout, the branch is simply kept as is during inference time, however for stochastic depth, the branch is re-scaled based on its survival probability, before being merged with the main network.

WindQAQ commented 4 years ago

Sounds good to me. Feel free to open an PR and request my review 😄 Also, welcome to ping me if you have any problem about testing suite and styles. Thank you.

MHStadler commented 4 years ago

@WindQAQ I have two implementations that work - one is a wrapper, that wraps the addition layer, before the residual branch is merged with the main network

The other is simply a layer that either replaces the Addition itself, or alternatively could just be placed last in the residual branch, scaling the branch before the normal add

They seem to be equally as performant, and both have the same regularizing effect (see this CIFAR10 ResNet32 benchmark https://github.com/MHStadler/tf_keras_playground/blob/master/Cifar10_StochDepth_Showcase.ipynb)

So I guess it just depends on which one fits better - I prefer the layer (since it's more straight forward, and also more flexible), but previous discussions were centered around the wrapper idea, so I figured I'd ask

The barebones layer implementation is available here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/stochastic_depth.py And the wrapper one here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/wrappers.py

For how they are used in the model, see the model code; https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/models/resnet.py#L64 (0 means constant depth, 1 means wrapper, and 2 means layer stochastic depth)

Let me know which one you think would be better suited - note that both have the "issue" of still being updated with historical gradients, even if they are dropped. But my benchmark used NAG momentum like the original paper, and the regularizing effect remains, just without the performance gains of completely skipping the layers in the forward pass (looking at the original pytorch implementation, I don't think this is achieveable in tensorflow, without writing a custom training loop) But this behavious is inline with how the very close related Dropout behaves in tensorflow, so I think it should be fine

WindQAQ commented 4 years ago

@WindQAQ I have two implementations that work - one is a wrapper, that wraps the addition layer, before the residual branch is merged with the main network

The other is simply a layer that either replaces the Addition itself, or alternatively could just be placed last in the residual branch, scaling the branch before the normal add

They seem to be equally as performant, and both have the same regularizing effect (see this CIFAR10 ResNet32 benchmark https://github.com/MHStadler/tf_keras_playground/blob/master/Cifar10_StochDepth_Showcase.ipynb)

So I guess it just depends on which one fits better - I prefer the layer (since it's more straight forward, and also more flexible), but previous discussions were centered around the wrapper idea, so I figured I'd ask

The barebones layer implementation is available here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/stochastic_depth.py And the wrapper one here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/wrappers.py

For how they are used in the model, see the model code; https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/models/resnet.py#L64 (0 means constant depth, 1 means wrapper, and 2 means layer stochastic depth)

Let me know which one you think would be better suited - note that both have the "issue" of still being updated with historical gradients, even if they are dropped. But my benchmark used NAG momentum like the original paper, and the regularizing effect remains, just without the performance gains of completely skipping the layers in the forward pass (looking at the original pytorch implementation, I don't think this is achieveable in tensorflow, without writing a custom training loop) But this behavious is inline with how the very close related Dropout behaves in tensorflow, so I think it should be fine

Nice survey! Vote for Layer subclass.

MHStadler commented 4 years ago

Okay - I guess Layer it is

Shouldn't take too long to finish it up

tensorflow / addons

Feature Request : Stochastic Depth/ResDrop #2032