pytorch / ao

PyTorch native quantization and sparsity for training and inference
BSD 3-Clause "New" or "Revised" License
1.63k stars 181 forks source link

CPUoffloadOptimizer issues #1209

Open felipemello1 opened 4 weeks ago

felipemello1 commented 4 weeks ago

hi all, i was giving the CPUOffloadOptimizer a try and found two issues when using with QLoRA single device in torchtune:

  1. When using a LR scheduler i got. Maybe there is a way to inherit the optimizer class?

    File "/data/users/felipemello/torchtune/torchtune/training/lr_schedulers.py", line 58, in get_cosine_schedule_with_warmup
    return LambdaLR(optimizer, lr_lambda, last_epoch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
    super().__init__(optimizer, last_epoch, verbose)
    File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
    raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
    TypeError: CPUOffloadOptimizer is not an Optimizer
  2. When passing model.params() i got the error below. I imagine that a simple fix is to keep only params that require grad, like adamw implementation oes

    File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torchao/prototype/low_bit_optim/cpu_offload.py", line 76, in __init__
    p_cuda.register_post_accumulate_grad_hook(backward_hook)
    File "/home/felipemello/.conda/envs/torchtune/lib/python3.11/site-packages/torch/_tensor.py", line 678, in register_post_accumulate_grad_hook
    raise RuntimeError(
    RuntimeError: cannot register a hook on a tensor that doesn't require gradient

cc: @gau-nernst

gau-nernst commented 4 weeks ago

1 is a known issue. You can see my view here https://github.com/pytorch/ao/issues/959#issuecomment-2378225308. I will look into torch.optim.Optimizer base class to see what could go wrong if I make CPUOffloadOptimizer inherit it. For example, on the top of my head, CPUOffloadOptimizer will not have self.state.

In the meantime, CPUOffloadOptimizer requires setting LR manually https://github.com/pytorch/ao/pull/584#issuecomment-2364915318

For 2, it's an oversight from my part. We can simply add a requires grad check here. Will push a fix https://github.com/pytorch/ao/blob/27619174ed5a372a1ce96a0615089c5a08c88566/torchao/prototype/low_bit_optim/cpu_offload.py#L68-L77

fzyzcjy commented 1 week ago

Hi, is there any updates? Thanks! It would be great if it can be directly plugged into huggingface transformers, but now it has errors caused by scheduler issue above:

[10:19:58.912]:     self.trainer.inner.train()
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
[10:19:58.912]:     output = super().train(*args, **kwargs)
[10:19:58.912]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[10:19:58.912]:     return inner_training_loop(
[10:19:58.912]:            ^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2224, in _inner_training_loop
[10:19:58.912]:     self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1130, in create_optimizer_and_scheduler
[10:19:58.912]:     self.create_scheduler(num_training_steps=num_training_steps, optimizer=optimizer)
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1632, in create_scheduler
[10:19:58.912]:     self.lr_scheduler = get_scheduler(
[10:19:58.912]:                         ^^^^^^^^^^^^^^
[10:19:58.912]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 550, in get_scheduler
[10:19:58.913]:     return schedule_func(
[10:19:58.913]:            ^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 132, in get_linear_schedule_with_warmup
[10:19:58.913]:     return LambdaLR(optimizer, lr_lambda, last_epoch)
[10:19:58.913]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
[10:19:58.913]:     super().__init__(optimizer, last_epoch, verbose)
[10:19:58.913]:   File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
[10:19:58.913]:     raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
[10:19:58.913]: TypeError: CPUOffloadOptimizer is not an Optimizer
gau-nernst commented 1 week ago

@fzyzcjy To unblock your case, you can try making CPUOffloadOptimizer a subclass of torch.optim.Optimizer i.e. change the following line

https://github.com/pytorch/ao/blob/aeff75bb42f1190f582b23b4c19c892d05f678ba/torchao/prototype/low_bit_optim/cpu_offload.py#L9

to class CPUOffloadOptimizer(Optimizer):. Make sure to not call super().__init__(), as this is just a workaround to pass the class check by PyTorch LR scheduler. I will investigate if this will cause other issues before merging the fix.

IMO, since Python is duck-typing, PyTorch LR scheduler should not explicitly check for the optimizer class.

fzyzcjy commented 1 week ago

Thank you!