luyug / GradCache

Run Effective Large Batch Contrastive Learning Beyond GPU/TPU Memory Constraint
Apache License 2.0
327 stars 19 forks source link

Role of dot product operation in forward-backward pass #29

Open ahmed-tabib opened 3 months ago

ahmed-tabib commented 3 months ago

Hello, When reading the implementation, I noticed that in the forward-backward pass, you used a dot-product before running the backward pass, specifically in the following line: https://github.com/luyug/GradCache/blob/0c33638cb27c2519ad09c476824d550589a8ec38/src/grad_cache/grad_cache.py#L241 I can't understand this, when reading the paper I imagined that you would directly use the gradients cached, something like:

reps.backward(gradient=gradients)

How exactly does the "surrogate" work to utilise the cached gradient? and why wouldn't the "standard" way of doing it work? Thanks.