learnables / learn2learn

A PyTorch Library for Meta-learning Research
http://learn2learn.net
MIT License
2.68k stars 354 forks source link

Is accumulating adaption loss then backward the same effect with accumulating gradients #388

Closed NookLook2014 closed 1 year ago

NookLook2014 commented 1 year ago

With the following code, I tried to replace the first accumulating gradients then averaging them with first accumulating loss then computing gradients, and it also works and much faster. But I'm not quite farmiliar with meta-learning so not sure my way has the same effect as the typical way in the example code.

`for iteration in range(1, num_iterations+1): opt.zero_grad() meta_train_error = 0.0 meta_train_accuracy = 0.0

    **batch_loss = None**
    for task in range(meta_batch_size):
        learner = maml.clone()
        batch = train_tasks.sample()
        evaluation_error, evaluation_accuracy = fast_adapt(batch,
                                                           learner, ...,
                                                           adaptation_steps,
                                                           device)
        **if batch_loss is None:
            batch_loss = evaluation_error
        else:
            batch_loss += evaluation_error.item()**
        #evaluation_error.backward()
        meta_train_error += evaluation_error.item()
        meta_train_accuracy += evaluation_accuracy

    # Average the accumulated loss and optimize
    **batch_loss /= meta_batch_size
    batch_loss.backward()**
    # Average the accumulated gradients and optimize
    #for p in maml.parameters():
    #    p.grad.data.mul_(1.0 / meta_batch_size)
    opt.step()`
seba-1511 commented 1 year ago

Hello @NookLook2014,

Yes, the two are the same but accumulating gradients is much cheaper than accumulating loss because you can free the activations.

One issue in your code: you accumulate evaluation_error.item() but should accumulate evaluation_error.

NookLook2014 commented 1 year ago

Hello @NookLook2014,

Yes, the two are the same but accumulating gradients is much cheaper than accumulating loss because you can free the activations.

One issue in your code: you accumulate evaluation_error.item() but should accumulate evaluation_error.

Thanks for your confirm. As to the issue, I used the evaluation_error but faced with the CUDA_OUT_OF_MEMORY issue even the meta_bsz is quite small.