huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.76k stars 943 forks source link

ReLU implementation #1394

Open viktorlott opened 11 months ago

viktorlott commented 11 months ago

Hey, totally random but kind of interesting. Hopefully not something people already know.

One interesting property of the IEEE 754 floating-point specification is that x/0 where x!=0 is equal to infinity, and x/infinity is equal to zero. So one could define a function like x/(1 + 0^x), which basically equals ReLU(x)=max(0, x).

The interesting part is when you derivate these two. In the ReLU corner, we basically end up with piecewise function (majority would know this), where x>0 = 1, x<0 = 0, and x=0 = undefined.

Deriving x/(1 + 0^x) on the other hand is apparently undefined at first glance, but one could intuit that it should be 1/(1 + 0^x) based on the ReLU derivation. The difference between these two derivatives is that 1/(1 + 0^x) makes x=0 definable (x=0 = 0.5, which is kind of weird.. I wonder how l1-regularization would work in that case.. if I'm thinking correctly). (Just realized that x/(x + 0^x) makes x=0 = 1).

My question is if this technique could be "faster" and more energy efficient (on certain architectures) than running regular SIMD instructions on the CPU, given that this technique would be SIMD executed on efficient logical FPUs?

viktorlott commented 11 months ago

I'm guessing the answer will be no, given that there are a lot of smart people working on those acceleration problems, so someone would probably have made a note out of it

robertknight commented 11 months ago

My question is if this technique could be "faster" and more energy efficient (on certain architectures) than running regular SIMD instructions on the CPU, given that this technique would be SIMD executed on efficient logical FPUs?

The architectures that I'm aware of (eg. Intel, ARM) have fast SIMD instructions for evaluating max(x, y) on floats, instructions which are much faster than evaluating division. Compare the latency and throughput for _mm256_max_ps vs _mm256_div_ps in the Intel Intrinsics Guide for example.