Simplify the reverse mode by having just one DiffMode for it

PetroZarytskyi commented 3 days ago

Currently, on master, we have two reverse diff modes: DiffMode::reverse for gradients and DiffMode::experimental_pullback for pullbacks. In this PR, they are essentially merged into DiffMode::reverse. This has been achieved by placing a pullback in the gradient overload instead of a gradient function. Let's consider an example:

// Original code:
double f(double a, double b) {
    ...
}

int main() {
    auto df = clad::gradient(f);
    ...
}

-> On master:

// Generated derivative code:
double f_grad(double a, double b, double *_d_a, double *_d_b) {
    ...
}

// This overload is placed in the clad::gradient call
double f_grad(double a, double b, void *_temp_d_a, void *_temp_d_b) {
    double *_d_a = (double *)_temp_d_a;
    double *_d_b = (double *)_temp_d_b;
    f_grad(a, b, _d_a, _d_b);
}

In this PR:

// Generated derivative code:
double f_pullback(double a, double b, double _d_y, double *_d_a, double *_d_b) {
    ...
}

// This is placed in the clad::gradient call
double f_grad(double a, double b, void *_temp_d_a, void *_temp_d_b) {
    double *_d_a = (double *)_temp_d_a;
    double *_d_b = (double *)_temp_d_b;
    f_pullback(a, b, 1, _d_a, _d_b);
}

Note: To make this system work with error estimation, I had to enable overloads there. To do that, I had to change the type of _final_error parameters from double& to double*.

Advantages: 1) On master, we have 11 DiffModes, many of which use the same visitors. Having a unified reverse DiffMode makes the system easier to understand. 2) In RMV, Derive and DerivePullback do almost the same job. This PR removes DerivePullback completely. 3) With this PR, clad does not use overloads for the reverse mode anymore: just one gradient function and one pullback function. This is a great step towards supporting C, which does not have overloads. 4) Differentiating recursive functions used to generate both the gradient and the pullback. Now only the pullback is generated.

Disadvantages: 1) Now gradient forward declaration is only supported with void* adjoint parameter types. e.g. for a function double f(double a, double b), it doesn't make sense anymore to forward declare void f_grad(double a, double b, double *_d_a, double *_d_b). The options void f_grad(double a, double b, void *_d_a, void *_d_b) and void f_pullback(double a, double b, double _d_y, void *_d_a, void *_d_b) still work. However, forward declarations don't seem to be that widely used. For example, when we changed all array_ref adjoint types to pointers in the gradient signature, this didn't break a single ROOT test. The main way to execute derivatives (with CladFunction) works as before. 2) Now all differentiated functions have the pullback _d_y parameter. This may make it harder to understand the derivative code. Moreover, every time the function has a parameter named y, the pullback parameter will be renamed to _d_y0 to avoid name collisions. This could make the code even more confusing. However, we can fix the last problem by giving the pullback parameter a different name.

PetroZarytskyi commented 3 days ago

This is a big change so having different opinions on the PR would be great. We discussed the idea with @vgvassilev. @vaithak I'd love to know your thoughts.

vaithak commented 2 days ago

This looks really good 👍🏼 Thanks, @PetroZarytskyi, for improving this.

vgvassilev / clad

Simplify the reverse mode by having just one DiffMode for it #964