functorch-based per sample gradients don't match with vanilla

See #510

One of the new tests introduced in PR #510 fails. When running a module with two different custom implementations of a "linear-like" layer, per sample gradients computed by functorch-based hooks don't match with per sample gradients obtained by microbatching.

Interesting observations:

gradients are mismatched for only one parameter tensor (out of 5)
gradients differs by the factor of 2 (with batch_size=64, so it's not it)

I've verified and I think the test is working correctly and the problem is likely genuine

pytorch / opacus

functorch-based per sample gradients don't match with vanilla #511