Closed rtfgithub closed 2 years ago
Hi @rtfgithub ,
Like stochastic depth, LayerScale can help train the models. For more details, please refer to the following paper that proposes this operator.
thanks for your reply!I'd like to ask you another question. As mentioned in your paper, the Training strategy of poolformer follow DEIT method. Did you add the distillation token to model and use the distillation method with Hard-label distillation loss?
Hi @rtfgithub ,
We follow the training hyper-parameters of DeiT but we don't add distillation methods.
thank u for your reply!Good luck!
You are welcome :)
thanks for your great contribution! in the implement for poolformerblock ,there is a layer_scale after token_mixer. What is the impact of this operation?