8-bit sigmoid function through piecewise linear approximation

benjaminsavage commented 1 month ago

For 8-bit signed values that are assumed to represent x = -8 to x = 8 (that is, each consecutive 8-bit value is 1/16th larger than the previous one), one can very closely approximate the sigmoid function $1/(1 + exp(-x))$ with just a piecewise linear function. Simply split up the range into 16 segments of width 16 (e.g. [0, 16), [16, 32), ...) and approximate each with a line segment.

Best of all, this function only requires 31 bits of communication (per helper) to compute this, less than a single multiplication in our Fp32BitField. The round-depth is only 3 since mostly it's all parallelised. It's also vectorised, so we can pack many parallel computations together.

Here's a graph showing the true sigmoid function in blue, and this approximation in red. As you can see, they are quite close.

approximate_sigmoid

The 8-bit output should be interpreted as 256 values that lie on the interval [0, 1). Interpret the 8-bit output as an unsigned integer representing the numerator of a fraction where the denominator is 256.

Here's a spreadsheet where I developed this algorithm, and it includes the full truth table checking the values generated: https://docs.google.com/spreadsheets/d/1udT5PyEcFQcbBrokwmJj8plTMOGZWg2ae-ei8GIWa7Y/edit?usp=sharing

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.84%. Comparing base (62da72d) to head (efc609a).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #1152 +/- ## ========================================== + Coverage 91.79% 91.84% +0.05% ========================================== Files 190 191 +1 Lines 28434 28622 +188 ========================================== + Hits 26100 26288 +188 Misses 2334 2334 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

benjaminsavage commented 3 weeks ago

My understanding is that it's most critical to closely approximate the sigmoid near x=0. The gradient is maximized there, and this has the biggest impact on the decision boundary / output of the neural network.

It's easy to approximate that region well since it's basically just a line at that point.

I've seen papers where they used a far less precise approximation, basically just three linear segments. A flat one for the left (y=0), a line segment sloping up (y=x+0.5) and a flat one for the right (y=1). And they were able to even do training with this.

I'm totally open to exploring a variety of functions here. For now I would like to start by getting just some proof of concept up and running. Then we can add alternative protocols and compare the cost / output of the neural net across them. How does that sound?

danielmasny commented 3 weeks ago

My understanding is that it's most critical to closely approximate the sigmoid near x=0. The gradient is maximized there, and this has the biggest impact on the decision boundary / output of the neural network.

It's easy to approximate that region well since it's basically just a line at that point.

I've seen papers where they used a far less precise approximation, basically just three linear segments. A flat one for the left (y=0), a line segment sloping up (y=x+0.5) and a flat one for the right (y=1). And they were able to even do training with this.

I'm totally open to exploring a variety of functions here. For now I would like to start by getting just some proof of concept up and running. Then we can add alternative protocols and compare the cost / output of the neural net across them. How does that sound?

Sounds great! I am very keen to explore the impact of the approximate sigmoid on the model accuracy.

benjaminsavage commented 3 weeks ago

Awesome. Let's land this just to get a baseline (some cost, some precision), then we can pretty easily compare it to other approaches, since this is just a standalone function that we can easily swap out with alternatives. If either @danielmasny or @akoshelev wants to approve this PR, let's merge it.

private-attribution / ipa

8-bit sigmoid function through piecewise linear approximation #1152

Codecov Report