kozistr / pytorch_optimizer

optimizer & lr scheduler & loss function collections in PyTorch
https://pytorch-optimizers.readthedocs.io/en/latest/
Apache License 2.0
219 stars 19 forks source link

sophiah in https://github.com/booydar/LM-RMT #194

Open robotzheng opened 1 year ago

robotzheng commented 1 year ago

params = 151111638

non emb params = 41066400

| epoch 1 step 50 | 50 batches | lr 0.06 | ms/batch 1378.43 | loss 7.85 | ppl 2570.784 | epoch 1 step 100 | 100 batches | lr 0.06 | ms/batch 968.61 | loss 7.49 | ppl 1787.593 | epoch 1 step 150 | 150 batches | lr 0.06 | ms/batch 971.58 | loss 7.48 | ppl 1769.387 | epoch 1 step 200 | 200 batches | lr 0.06 | ms/batch 969.84 | loss 7.47 | ppl 1760.055 | epoch 1 step 250 | 250 batches | lr 0.06 | ms/batch 973.37 | loss 7.46 | ppl 1738.300 | epoch 1 step 300 | 300 batches | lr 0.06 | ms/batch 970.12 | loss 7.48 | ppl 1772.002 | epoch 1 step 350 | 350 batches | lr 0.06 | ms/batch 970.52 | loss 7.47 | ppl 1751.793 | epoch 1 step 400 | 400 batches | lr 0.06 | ms/batch 973.12 | loss 7.47 | ppl 1755.161 | epoch 1 step 450 | 450 batches | lr 0.06 | ms/batch 970.79 | loss 7.46 | ppl 1736.315 | epoch 1 step 500 | 500 batches | lr 0.06 | ms/batch 974.13 | loss 7.48 | ppl 1765.010 | epoch 1 step 550 | 550 batches | lr 0.06 | ms/batch 973.86 | loss 7.48 | ppl 1778.569 Traceback (most recent call last): File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 620, in train() File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 540, in train optimizer.step() File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step self.compute_hutchinson_hessian( File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1) File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/autograd/init.py", line 303, in grad return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: res[i].defined() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/functions/tensor.cpp":142, please report a bug to PyTorch.

kozistr commented 1 year ago

hmm, I guess it's not the optimizer problem, but maybe Pytorch autograd internal or the training code (e.g. model, loss, etc) issue.

I just found that a similar error occurs when the loss function is CPU-version loss.

maybe, some modules are not on the same device or there're unreachable graphs (leading to not-backprop-able).

i404788 commented 1 year ago

Strange that it triggers only after so many steps seems like it would be a pytorch/sync issue.

Just wanted to say, if you are using Cross-Entropy loss (for LM) SophiaG variant is more efficient (since it's just squaring the gradient, see https://github.com/Liuhong99/Sophia/blob/19f45d30723bbffcce3d18e4e858d95b0f36dbb6/sophia.py#L56), you can use it like so (not tested):

hessian = list(map(lambda p: p.grad * p.grad, model.parameters()))
opt.step(hessian=hessian)

This also skips the 2nd order gradient calculation, so it could resolve your issue.

EDIT: you also need to filter out the non-trainable & sparse parameters so it would be more like:

hessian = [p.grad*p.grad for p in model.parameters() if p.requires_grad and p.grad is not None and not p.grad.is_sparse]
opt.step(hessian=hessian)
robotzheng commented 1 year ago

SophiaG worked, but the perfomace is not better than Adam, maybe because of the bias. So I want to try SophiaH, which hasn't the bias.

i404788 commented 1 year ago

Some last things to check:

If this is all correct then it pretty much has to be a bug in pytorch (or the training code).

Vectorrent commented 11 months ago

I have been running into a similar error message. I've been trying to use SophiaH with Lightning AI's automatic_optimization feature, but it always fails:

Traceback (most recent call last):
  File "/src/trainer.py", line 403, in <module>
    ai.train(
  File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train
    trainer.fit(train_model)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 219, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 188, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 266, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1276, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
    self.compute_hutchinson_hessian(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
    h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

If I iterate through every parameter group, and set requires_grad() to True, then I go OOM immediately at the "update_period" step:

for n, p in self.model.named_parameters():
    p.requires_grad = True

If I set requires_grad() to False, then training will progress - but the model never learns anything.

If "requires_grad" in unset for ANY parameter group, I get the original error message.

I am unsure how to proceed at this point, but I would greatly appreciate any advice you have to offer.

kozistr commented 11 months ago

I have been running into a similar error message. I've been trying to use SophiaH with Lightning AI's automatic_optimization feature, but it always fails:

Traceback (most recent call last):
  File "/src/trainer.py", line 403, in <module>
    ai.train(
  File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train
    trainer.fit(train_model)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 219, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 188, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 266, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1276, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
    self.compute_hutchinson_hessian(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
    h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

If I iterate through every parameter group, and set requires_grad() to True, then I go OOM immediately at the "update_period" step:

for n, p in self.model.named_parameters():
    p.requires_grad = True

If I set requires_grad() to False, then training will progress - but the model never learns anything.

If "requires_grad" in unset for ANY parameter group, I get the original error message.

I am unsure how to proceed at this point, but I would greatly appreciate any advice you have to offer.

Hello!

SophiaH optimizer needs to be set create_graph=True when calling backward(). means that automatic_optimization should be set False!

here's an example.

import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning.pytorch as pl

from pytorch_optimizer import SophiaH
from torch.optim import Optimizer

# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))

class LitAutoEncoder(pl.LightningModule):
    def __init__(self, encoder, decoder):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder

        self.automatic_optimization = False

    def training_step(self, batch, batch_idx):
        opt = self.optimizers()
        opt.zero_grad()

        x, y = batch
        x = x.view(x.size(0), -1)
        z = self.encoder(x)
        x_hat = self.decoder(z)

        loss = nn.functional.mse_loss(x_hat, x)

        # important
        self.manual_backward(loss, create_graph=True)
        opt.step()

        self.log("train_loss", loss)

    def configure_optimizers(self):
        return SophiaH(self.parameters())

dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)

autoencoder = LitAutoEncoder(encoder, decoder)

trainer = pl.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
Vectorrent commented 11 months ago

Thank you for the quick response! I have applied your example to my own code (to the best of my ability), and while we're making progress, training bombs with a new error after reaching the first update_period:

Traceback (most recent call last):
  File "/src/trainer.py", line 403, in <module>
    ai.train(
  File "/usr/local/lib/python3.10/dist-packages/aitextgen/aitextgen.py", line 804, in train
    trainer.fit(train_model)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 221, in advance
    batch_output = self.manual_optimization.run(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/manual.py", line 91, in run
    self.advance(kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/manual.py", line 111, in advance
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 380, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/aitextgen/train.py", line 59, in training_step
    opt.step()
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 161, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 231, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision_plugin.py", line 116, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
    self.compute_hutchinson_hessian(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
    h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 303, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 4 of tensors does not require grad and does not have a grad_fn

I don't suspect this is the cause, but there is a warning at the beginning of training:

/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py:200: UserWarning: Using backward() with create_graph=True will create a reference cycle between the parameter and its gradient which can cause a memory leak. We recommend using autograd.grad when creating the graph to avoid this. If you have to use this function, make sure to reset the .grad fields of your parameters to None after use to break the cycle and avoid the leak. (Triggered internally at ../torch/csrc/autograd/engine.cpp:1151.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

It may be relevant to know that I am using the Huggingface PEFT library for LoRA training. I don't suspect that is the issue either, since all that really does is add some extra layers to the model, and freeze all the other layers.

I will troubleshoot some more when I get the chance. It's been a long day already, and I need to take a break. Thank you for the help thus far, and for maintaining such a useful library!

Vectorrent commented 11 months ago

Alright, well I was able to test your example MNIST code, and it does work. So I know this isn't an environment issue.

I removed PEFT as well, and tried standard fine-tuning. I also tried a couple of different models (GPT-2 and GPT-Neo), from Huggingface Transformers library. All ran into the same problem with "tensors does not require grad and does not have a grad_fn".

I'm sure the issue has to do with my training code. I'm carrying some legacy baggage, and I don't really have the proper skill set to know how to optimize manually (which is why I've relied on automatic_optimization until now). I haven't given up, but I probably am going to move on for now. I appreciate your help.