Closed sangkeun00 closed 7 months ago
@hwijeen @sangkeun00 I'll take care of this!
@YoungseogChung Another option you can consider is deleting param.grad
as soon as it gets populated in (probably) backward_hook
, especially if setting requires_grad=False
changes the hook behavior.
I actually tried del module.weight/bias.grad
idea in the recent commit.
Code: https://github.com/sangkeun00/analog/blob/main/analog/logging.py#L134-L136
However, it doesn't seem to reduce the GPU memory usage. If @YoungseogChung can debug the issue, that would be great! Also, feel free to propose a different strategy to reduce the GPU memory usage.
Got it, thanks for the heads up, and too bad it's not that simple lol. Yes, I'll tinker with it and search for solutions.
This issue is solved in d2e2fea0becc0632cb03a008fac9454f36b957d7 .
In AnaLog, we are mostly interested in per-sample gradient instead of mini-batch gradient. Therefore, we don't necessarily need to populate gradient for each parameter to
param.grad
. By settingrequires_grad=False
for all parameters, we can potentially save a lot of GPU memory.Proposed design
Caution
We should make sure that setting
requires_grad=False
doesn't change the behavior of forward/backward/grad_hooks of the module these parameters belong to.