Open Ebanflo42 opened 5 months ago
Also would make sense to add gradient clipping as a part of the solution to this issue.
Scratch that, gradient clipping should be an extra feature in the core autodiff engine, i don't know why I was thinking these things are related.
We need to carefully design the API such that the user has access to both an executable that returns the desired
diff
calls (for training) and an executable that returns everything except that (for testing).This is part of a larger array of issues that will emerge from the need to embed contexts in other contexts (for example, separating the optimizer step, or designing recurrent architectures). In this case it might make sense to allow the user to design a forward pass context which doesn't take labels or output gradients, then allow them to clone that context and recover all desired node identifiers in order to create another context that takes both labels and inputs and outputs both predictions and loss and gradients. Then both executables can be used separately.