mu的初始值 - Githubissues

nankeqin commented 2 years ago

请问SMU中mu的初始值该如何确定？必须是1000000吗？

iFe1er commented 2 years ago

根据原论文定的，很容易验证SMU在mu= 1000000.0 和SMU-1在mu=4.352665993287951e-9 基本是一样的

koushik313 commented 2 years ago

No, for SMU it must not be 1000000.0, you can initialize it at 1.0 as well and it works remarkably compared to other widely used activations. SMU-1 is a computationally bit cheap activation function while it approximates Leaky ReLU (or ReLU depending on the value of alpha) from above, so there is a part of the curve that produces a positive outcome in the negative axes if you choose the large value of mu and it will create the same problem that Softplus have (see https://stats.stackexchange.com/questions/146057/what-are-the-benefits-of-using-relu-over-softplus-as-activation-functions ) So, the paper suggested small initialization for SMU-1 by approximation of Leaky ReLU as close as possible, and irrespective of the value of mu, the function SMU-1 will still be a smooth function (differentiable) in the whole real line and smoothness is important during backpropagation. The main idea of this paper is, ReLU or Leaky ReLU are not differentiable at the origin, so what is the effect if these non-differentiable functions are replaced by a curve that is differentiable in the whole real line.

From my experience, SMU is much better than SMU-1 while SMU has a little bit higher training time than SMU-1.

nankeqin commented 2 years ago

Thanks a lot!

At 2021-11-24 15:06:23, "koushik313" @.***> wrote:

No, for SMU it must not be 1000000.0, you can initialize it at 1.0 as well and it works remarkably compared to other widely used activations. SMU-1 is a computationally bit cheap activation function while it approximates Leaky ReLU (or ReLU depending on the value of alpha) from above, so there is a part of the curve that produces a positive outcome in the negative axes if you choose the large value of mu and it will create the same problem that Softplus have (see https://stats.stackexchange.com/questions/146057/what-are-the-benefits-of-using-relu-over-softplus-as-activation-functions ) So, the paper suggested small initialization for SMU-1 by approximation of Leaky ReLU as close as possible, and irrespective of the value of mu, the function SMU-1 will still be a smooth function (differentiable) in the whole real line and smoothness is important during backpropagation. The main idea of this paper is, ReLU or Leaky ReLU are not differentiable at the origin, so what is the effect if these non-differentiable functions are replaced by a curve that is differentiable in the whole real line.

From my experience, SMU is much better than SMU-1 while SMU has a little bit higher training time than SMU-1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

Tears1997 commented 2 years ago

根据原论文定的，很容易验证SMU在mu= 1000000.0 和SMU-1在mu=4.352665993287951e-9 基本是一样的

@koushik313 I try to initialize mu in SMU to 1000000.0 and use SMU to replace the activation function of conv module in yolov5, but the objectloss value will change to 'nan' from the first epoch of training. Therefore, I think what @ koushik313 said is correct. In practical use, it is best to initialize mu in SMU to a smaller value (such as 1.0).

我尝试将 SMU 中的 mu 初始化为 1000000.0 并使用 SMU 替换 YOLOv5 中 Conv 模块的激活函数，但是 objectloss 值会从训练的第一个 epoch开始就变为 'nan'。因此，我认为@koushik313 说的是正确的，在实际使用中，最好将SMU中的mu初始化为较小的值（如1.0）。

koushik313 commented 2 years ago

@Tears1997 Thanks for the information you shared. I will also recommend you try to initialize alpha at 0.01 and mu at 2.0 or 2.5 (use mu as a trainable parameter) for SMU and then run your experiments. From, my experience, these initializations provide better results. Also, please let me know did you get nan values when you initialize mu at 1.0 for SMU?

根据原论文定的，很容易验证SMU在mu= 1000000.0 和SMU-1在mu=4.352665993287951e-9 基本是一样的

@koushik313 I try to initialize mu in SMU to 1000000.0 and use SMU to replace the activation function of conv module in yolov5, but the objectloss value will change to 'nan' from the first epoch of training. Therefore, I think what @ koushik313 said is correct. In practical use, it is best to initialize mu in SMU to a smaller value (such as 1.0).

我尝试将 SMU 中的 mu 初始化为 1000000.0 并使用 SMU 替换 YOLOv5 中 Conv 模块的激活函数，但是 objectloss 值会从训练的第一个 epoch开始就变为 'nan'。因此，我认为@koushik313 说的是正确的，在实际使用中，最好将SMU中的mu初始化为较小的值（如1.0）。

Tears1997 commented 2 years ago

@Tears1997 Thanks for the information you shared. I will also recommend you try to initialize alpha at 0.01 and mu at 2.0 or 2.5 (use mu as a trainable parameter) for SMU and then run your experiments. From, my experience, these initializations provide better results. Also, please let me know did you get nan values when you initialize mu at 1.0 for SMU?

根据原论文定的，很容易验证SMU在mu= 1000000.0 和SMU-1在mu=4.352665993287951e-9 基本是一样的

@koushik313 I try to initialize mu in SMU to 1000000.0 and use SMU to replace the activation function of conv module in yolov5, but the objectloss value will change to 'nan' from the first epoch of training. Therefore, I think what @ koushik313 said is correct. In practical use, it is best to initialize mu in SMU to a smaller value (such as 1.0). 我尝试将 SMU 中的 mu 初始化为 1000000.0 并使用 SMU 替换 YOLOv5 中 Conv 模块的激活函数，但是 objectloss 值会从训练的第一个 epoch开始就变为 'nan'。因此，我认为@koushik313 说的是正确的，在实际使用中，最好将SMU中的mu初始化为较小的值（如1.0）。

@koushik313 Thank you for your reply and suggestion! If I initialized the mu at 1.0 and alpha at 0.25 for SMU, the training can be carried out normally (without nan value). But compared with the original model, the mAP value did not increase or decrease significantly (on my own dataset).

koushik313 commented 2 years ago

@Tears1997 Thank you for your reply. Yes, I agree, and for the classification problem, at alpha=0.25, the functions work well but for object detection, you need to choose alpha=0.01. But in general, I observe that alpha=0.01 and mu at 2.5 work better than ReLU, Swish, and Mish in most deep learning problems. For object detection, I tested on the SSD300 model but did not test on Yolo5. If you have time in future, please check for the Yolo5 model with these alpha and mu values and let me know the results. Thank you.

@Tears1997 Thanks for the information you shared. I will also recommend you try to initialize alpha at 0.01 and mu at 2.0 or 2.5 (use mu as a trainable parameter) for SMU and then run your experiments. From, my experience, these initializations provide better results. Also, please let me know did you get nan values when you initialize mu at 1.0 for SMU?

根据原论文定的，很容易验证SMU在mu= 1000000.0 和SMU-1在mu=4.352665993287951e-9 基本是一样的

@koushik313 I try to initialize mu in SMU to 1000000.0 and use SMU to replace the activation function of conv module in yolov5, but the objectloss value will change to 'nan' from the first epoch of training. Therefore, I think what @ koushik313 said is correct. In practical use, it is best to initialize mu in SMU to a smaller value (such as 1.0). 我尝试将 SMU 中的 mu 初始化为 1000000.0 并使用 SMU 替换 YOLOv5 中 Conv 模块的激活函数，但是 objectloss 值会从训练的第一个 epoch开始就变为 'nan'。因此，我认为@koushik313 说的是正确的，在实际使用中，最好将SMU中的mu初始化为较小的值（如1.0）。

@koushik313 Thank you for your reply and suggestion! If I initialized the mu at 1.0 and alpha at 0.25 for SMU, the training can be carried out normally (without nan value). But compared with the original model, the mAP value did not increase or decrease significantly (on my own dataset).

iFe1er / SMU_pytorch

mu的初始值 #4