Closed benjaminsavage closed 3 weeks ago
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 91.84%. Comparing base (
62da72d
) to head (efc609a
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
My understanding is that it's most critical to closely approximate the sigmoid near x=0. The gradient is maximized there, and this has the biggest impact on the decision boundary / output of the neural network.
It's easy to approximate that region well since it's basically just a line at that point.
I've seen papers where they used a far less precise approximation, basically just three linear segments. A flat one for the left (y=0), a line segment sloping up (y=x+0.5) and a flat one for the right (y=1). And they were able to even do training with this.
I'm totally open to exploring a variety of functions here. For now I would like to start by getting just some proof of concept up and running. Then we can add alternative protocols and compare the cost / output of the neural net across them. How does that sound?
My understanding is that it's most critical to closely approximate the sigmoid near x=0. The gradient is maximized there, and this has the biggest impact on the decision boundary / output of the neural network.
It's easy to approximate that region well since it's basically just a line at that point.
I've seen papers where they used a far less precise approximation, basically just three linear segments. A flat one for the left (y=0), a line segment sloping up (y=x+0.5) and a flat one for the right (y=1). And they were able to even do training with this.
I'm totally open to exploring a variety of functions here. For now I would like to start by getting just some proof of concept up and running. Then we can add alternative protocols and compare the cost / output of the neural net across them. How does that sound?
Sounds great! I am very keen to explore the impact of the approximate sigmoid on the model accuracy.
Awesome. Let's land this just to get a baseline (some cost, some precision), then we can pretty easily compare it to other approaches, since this is just a standalone function that we can easily swap out with alternatives. If either @danielmasny or @akoshelev wants to approve this PR, let's merge it.
For 8-bit signed values that are assumed to represent x = -8 to x = 8 (that is, each consecutive 8-bit value is 1/16th larger than the previous one), one can very closely approximate the sigmoid function $1/(1 + exp(-x))$ with just a piecewise linear function. Simply split up the range into 16 segments of width 16 (e.g. [0, 16), [16, 32), ...) and approximate each with a line segment.
Best of all, this function only requires 31 bits of communication (per helper) to compute this, less than a single multiplication in our Fp32BitField. The round-depth is only 3 since mostly it's all parallelised. It's also vectorised, so we can pack many parallel computations together.
Here's a graph showing the true sigmoid function in blue, and this approximation in red. As you can see, they are quite close.
The 8-bit output should be interpreted as 256 values that lie on the interval [0, 1). Interpret the 8-bit output as an unsigned integer representing the numerator of a fraction where the denominator is 256.
Here's a spreadsheet where I developed this algorithm, and it includes the full truth table checking the values generated: https://docs.google.com/spreadsheets/d/1udT5PyEcFQcbBrokwmJj8plTMOGZWg2ae-ei8GIWa7Y/edit?usp=sharing