Use of hyperbolic tangent for activation functoin

Hi,

Thanks you for open-sourcing this great idea ! I'm exploring your codes and doing some experiments with variants. The first thing is activation function. As written in BNN_cifar10.py, you used HardTanh function for activation function. I don't see any description on your paper about this, though I'm not 100% confident, but I found this significantly affects to accuracy anyway. Keeping ReLU with BNN as the full-precision counterpart does drops about 10% of top1 accuracy.

Do you have any insight about this? Because, when stacking very deep networks, I heard that hyperbolic tangent for activation could be a bad idea. I'm bit concerned about gradient vanishing problems, etc. If you can share some experience about this, why did you use the specific hyperbolic tangent function and so on, I'd be very nice.

Thanks in advance OYH

itayhubara / BinaryNet.tf

Use of hyperbolic tangent for activation functoin #11