Open ahmed-tabib opened 3 months ago
Hello, When reading the implementation, I noticed that in the forward-backward pass, you used a dot-product before running the backward pass, specifically in the following line: https://github.com/luyug/GradCache/blob/0c33638cb27c2519ad09c476824d550589a8ec38/src/grad_cache/grad_cache.py#L241 I can't understand this, when reading the paper I imagined that you would directly use the gradients cached, something like:
reps.backward(gradient=gradients)
How exactly does the "surrogate" work to utilise the cached gradient? and why wouldn't the "standard" way of doing it work? Thanks.
Hello, When reading the implementation, I noticed that in the forward-backward pass, you used a dot-product before running the backward pass, specifically in the following line: https://github.com/luyug/GradCache/blob/0c33638cb27c2519ad09c476824d550589a8ec38/src/grad_cache/grad_cache.py#L241 I can't understand this, when reading the paper I imagined that you would directly use the gradients cached, something like:
How exactly does the "surrogate" work to utilise the cached gradient? and why wouldn't the "standard" way of doing it work? Thanks.