huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.05k stars 26.81k forks source link

Run text-classification example with AdaHessian optimizer #19383

Closed iTsingalis closed 2 years ago

iTsingalis commented 2 years ago

System Info

torch 1.12.1+cu113 transformers 4.23.0.dev0

Information

Tasks

Reproduction

Hi, I want to use AdaHessian optimizer in text-classification example run_glue_no_trainer.py. To do so I have modified the part of the code where the optimizer is selected. That is, instead of this

# Optimizer
# Split weights in two groups, one with weight decay and the other not.
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate) 

and this,

for epoch in range(starting_epoch, args.num_train_epochs):
    model.train()
    if args.with_tracking:
        total_loss = 0
    for step, batch in enumerate(train_dataloader):
        # We need to skip steps until we reach the resumed step
        if args.resume_from_checkpoint and epoch == starting_epoch:
            if resume_step is not None and step < resume_step:
                completed_steps += 1
                continue
        outputs = model(**batch)
        loss = outputs.loss
        # We keep track of the loss at each epoch
        if args.with_tracking:
            total_loss += loss.detach().float()
        loss = loss / args.gradient_accumulation_steps # Do we need this? backwards does this calculation...
        accelerator.backward(loss)
        if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            completed_steps += 1

I am using this

 optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": args.weight_decay,
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
    },
]
If args.optimizer == 'AdamW':
    optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=args.learning_rate)
elif args.optimizer == 'AdaHessian':
    optimizer = AdaHessian(optimizer_grouped_parameters, lr=args.learning_rate)

and this

for epoch in range(starting_epoch, args.num_train_epochs):
    model.train()
    if args.with_tracking:
        total_loss = 0
    for step, batch in enumerate(train_dataloader):
        # We need to skip steps until we reach the resumed step
        if args.resume_from_checkpoint and epoch == starting_epoch:
            if resume_step is not None and step < resume_step:
                completed_steps += 1
                continue

        # batch = Variable(**batch, requires_grad=True)
        def closure(backward=True):
            if backward:
                optimizer.zero_grad()

            outputs = model(**batch)
            loss = outputs.loss

            if backward:
                # loss = Variable(loss, requires_grad=True) # Didn't help
                # create_graph=True is necessary for Hessian calculation
                accelerator.backward(loss, create_graph=True)
            return loss

        loss = closure(backward=False)

        # We keep track of the loss at each epoch
        if args.with_tracking:
            total_loss += loss.detach().float()

        if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step(closure=closure)
            lr_scheduler.step()
            progress_bar.update(1)
            completed_steps += 1

respectively. The AdaHessian is given here.

Expected behavior

Normally, it should continue training but

 RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn  

is returned by

  h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True, retain_graph=i < self.n_samples - 1)

in the optimizer's function

@torch.no_grad()
def set_hessian(self):
    """
    Computes the Hutchinson approximation of the hessian trace and accumulates it for each trainable parameter.
    """

    params = []
    for p in filter(lambda p: p.grad is not None, self.get_params()):
        if self.state[p]["hessian step"] % self.update_each == 0:  # compute the trace only each `update_each` step
            params.append(p)
        self.state[p]["hessian step"] += 1

    if len(params) == 0:
        return

    if self.generator.device != params[0].device:  # hackish way of casting the generator to the right device
        self.generator = torch.Generator(params[0].device).manual_seed(2147483647)

    grads = [p.grad for p in params]

    for i in range(self.n_samples):
        zs = [torch.randint(0, 2, p.size(), generator=self.generator, device=p.device) * 2.0 - 1.0 for p in params]  # Rademacher distribution {-1.0, 1.0}
        h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, only_inputs=True, retain_graph=i < self.n_samples - 1)
        for h_z, z, p in zip(h_zs, zs, params):
            p.hess += h_z * z / self.n_samples  # approximate the expected values of z*(H@z)

The error is returned because the grads created by the list param do not contain a _grad_fun. I suspect that the problem is related to the input of the optimizer (e.g. loss in the backward function). According to this post I have tried for example

   loss = Variable(loss, requires_grad=True) 

before backwards in the closure which make the script to start running but the accuracy is around 45% and does not improve. Could you please take a look at the problem and make a suggestion to overcome it?

EDIT:

I just noticed in the trace back that lr_scheduler is mentioned before the error in torch.autograd.grad.

Traceback (most recent call last):
    File "some root/run_glue.py", line 730, in <module>
      main()
    File "some root/run_glue.py", line 621, in main
      optimizer.step(closure=closure)
    File "some root/anaconda3/envs/AdaCubic/lib/python3.7/site-packages/accelerate/optimizer.py", line 140, in step
      self.optimizer.step(closure)
    File "some rootanaconda3/envs/AdaCubic/lib/python3.7/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
      return wrapped(*args, **kwargs)
    File "some rootanaconda3/envs/AdaCubic/lib/python3.7/site-packages/torch/optim/optimizer.py", line 113, in wrapper
      return func(*args, **kwargs)
    File "some root/anaconda3/envs/AdaCubic/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
      return func(*args, **kwargs)
    File "some root/AdaHessian.py", line 105, in step
      self.set_hessian()
    File some root/anaconda3/envs/AdaCubic/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
      return func(*args, **kwargs)
    File "some rootcubicReg/Code/Optimizers/AdaHessian.py", line 87, in set_hessian
      retain_graph=i < self.n_samples - 1)
    File "some root/anaconda3/envs/AdaCubic/lib/python3.7/site-packages/torch/autograd/__init__.py", line 278, in grad
      allow_unused, accumulate_grad=False)  # Calls into the C++ engine to run the backward pass
  RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
    0%|          | 0/6315 [00:02<?, ?it/s]

I was suspecting that something with the _grad_fn is happening in accelerator. Thus, commenting

    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

makes the optimization procedure to start running which indicates that _grad_fn are disabled somehow inside accelerator. Could you please some one suggest a way to overcome this problem?

LysandreJik commented 2 years ago

Hello, thanks for opening an issue! We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead?

Thanks!

cc @sgugger

iTsingalis commented 2 years ago

Sorry for my misplaced post. I think the problem is solved. To be honest I just re-entered the modifications on the original code more carefully and now it seems to be working. You can delete my post or move it to the forum if you find it more appropriate. Sorry for the inconvenience again.