Closed MHStadler closed 4 years ago
Sounds good to me. Feel free to open an PR and request my review 😄 Also, welcome to ping me if you have any problem about testing suite and styles. Thank you.
@WindQAQ I have two implementations that work - one is a wrapper, that wraps the addition layer, before the residual branch is merged with the main network
The other is simply a layer that either replaces the Addition itself, or alternatively could just be placed last in the residual branch, scaling the branch before the normal add
They seem to be equally as performant, and both have the same regularizing effect (see this CIFAR10 ResNet32 benchmark https://github.com/MHStadler/tf_keras_playground/blob/master/Cifar10_StochDepth_Showcase.ipynb)
So I guess it just depends on which one fits better - I prefer the layer (since it's more straight forward, and also more flexible), but previous discussions were centered around the wrapper idea, so I figured I'd ask
The barebones layer implementation is available here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/stochastic_depth.py And the wrapper one here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/wrappers.py
For how they are used in the model, see the model code; https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/models/resnet.py#L64 (0 means constant depth, 1 means wrapper, and 2 means layer stochastic depth)
Let me know which one you think would be better suited - note that both have the "issue" of still being updated with historical gradients, even if they are dropped. But my benchmark used NAG momentum like the original paper, and the regularizing effect remains, just without the performance gains of completely skipping the layers in the forward pass (looking at the original pytorch implementation, I don't think this is achieveable in tensorflow, without writing a custom training loop) But this behavious is inline with how the very close related Dropout behaves in tensorflow, so I think it should be fine
@WindQAQ I have two implementations that work - one is a wrapper, that wraps the addition layer, before the residual branch is merged with the main network
The other is simply a layer that either replaces the Addition itself, or alternatively could just be placed last in the residual branch, scaling the branch before the normal add
They seem to be equally as performant, and both have the same regularizing effect (see this CIFAR10 ResNet32 benchmark https://github.com/MHStadler/tf_keras_playground/blob/master/Cifar10_StochDepth_Showcase.ipynb)
So I guess it just depends on which one fits better - I prefer the layer (since it's more straight forward, and also more flexible), but previous discussions were centered around the wrapper idea, so I figured I'd ask
The barebones layer implementation is available here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/stochastic_depth.py And the wrapper one here: https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/layers/wrappers.py
For how they are used in the model, see the model code; https://github.com/MHStadler/tf_keras_playground/blob/master/tf_keras_playground/models/resnet.py#L64 (0 means constant depth, 1 means wrapper, and 2 means layer stochastic depth)
Let me know which one you think would be better suited - note that both have the "issue" of still being updated with historical gradients, even if they are dropped. But my benchmark used NAG momentum like the original paper, and the regularizing effect remains, just without the performance gains of completely skipping the layers in the forward pass (looking at the original pytorch implementation, I don't think this is achieveable in tensorflow, without writing a custom training loop) But this behavious is inline with how the very close related Dropout behaves in tensorflow, so I think it should be fine
Nice survey! Vote for Layer
subclass.
Okay - I guess Layer it is
Shouldn't take too long to finish it up
Describe the feature and the current behavior/state. Stochastic depth is a regularization technique to train very deep residual networks (the authors for example train ResNets of up to 1202 layers). Particularly, it allows training of, on average, more shallow networks, that retain their full depth at inference time.
Previous attempts (https://github.com/tensorflow/addons/issues/626) seem to have been unnecessarily convoluted. This implemenation would be a simple layer, attached to the end of a residual branch (as suggested here https://github.com/tensorflow/tensorflow/issues/8817#issuecomment-290788333, which is also the way it is described in the Shake-Drop paper: https://arxiv.org/abs/1802.02375)
Noteably this means that, if momentum is being used, the dropped layers will still receive small updates due to their historic gradients (similar to Dropout).
Relevant information
Which API type would this fall under (layer, metric, optimizer, etc.) layer Who will benefit with this feature? Anybody who wants to use stochastic depth to train deeper ResNets, or who wants to recreate the EfficientNet architecture (https://arxiv.org/abs/1905.11946). Anyone who wants to add Shake-Drop (https://arxiv.org/abs/1802.02375) to their network, can use this as base. Any other info. It is important to note that Stochastic Depth is not the same as Dropout with noise_shape=(1, 1, 1), as suggested in the tensorflow EfficientNet implementation (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/applications/efficientnet.py#L511). Using dropout, the branch is simply kept as is during inference time, however for stochastic depth, the branch is re-scaled based on its survival probability, before being merged with the main network.