Open felipemello1 opened 4 weeks ago
1 is a known issue. You can see my view here https://github.com/pytorch/ao/issues/959#issuecomment-2378225308. I will look into torch.optim.Optimizer
base class to see what could go wrong if I make CPUOffloadOptimizer
inherit it. For example, on the top of my head, CPUOffloadOptimizer
will not have self.state
.
In the meantime, CPUOffloadOptimizer
requires setting LR manually https://github.com/pytorch/ao/pull/584#issuecomment-2364915318
For 2, it's an oversight from my part. We can simply add a requires grad check here. Will push a fix https://github.com/pytorch/ao/blob/27619174ed5a372a1ce96a0615089c5a08c88566/torchao/prototype/low_bit_optim/cpu_offload.py#L68-L77
Hi, is there any updates? Thanks! It would be great if it can be directly plugged into huggingface transformers, but now it has errors caused by scheduler issue above:
[10:19:58.912]: self.trainer.inner.train()
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 434, in train
[10:19:58.912]: output = super().train(*args, **kwargs)
[10:19:58.912]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
[10:19:58.912]: return inner_training_loop(
[10:19:58.912]: ^^^^^^^^^^^^^^^^^^^^
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 2224, in _inner_training_loop
[10:19:58.912]: self.create_optimizer_and_scheduler(num_training_steps=max_steps)
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1130, in create_optimizer_and_scheduler
[10:19:58.912]: self.create_scheduler(num_training_steps=num_training_steps, optimizer=optimizer)
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/transformers/trainer.py", line 1632, in create_scheduler
[10:19:58.912]: self.lr_scheduler = get_scheduler(
[10:19:58.912]: ^^^^^^^^^^^^^^
[10:19:58.912]: File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 550, in get_scheduler
[10:19:58.913]: return schedule_func(
[10:19:58.913]: ^^^^^^^^^^^^^^
[10:19:58.913]: File "/opt/conda/lib/python3.11/site-packages/transformers/optimization.py", line 132, in get_linear_schedule_with_warmup
[10:19:58.913]: return LambdaLR(optimizer, lr_lambda, last_epoch)
[10:19:58.913]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[10:19:58.913]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 336, in __init__
[10:19:58.913]: super().__init__(optimizer, last_epoch, verbose)
[10:19:58.913]: File "/opt/conda/lib/python3.11/site-packages/torch/optim/lr_scheduler.py", line 99, in __init__
[10:19:58.913]: raise TypeError(f"{type(optimizer).__name__} is not an Optimizer")
[10:19:58.913]: TypeError: CPUOffloadOptimizer is not an Optimizer
@fzyzcjy To unblock your case, you can try making CPUOffloadOptimizer
a subclass of torch.optim.Optimizer
i.e. change the following line
to class CPUOffloadOptimizer(Optimizer):
. Make sure to not call super().__init__()
, as this is just a workaround to pass the class check by PyTorch LR scheduler. I will investigate if this will cause other issues before merging the fix.
IMO, since Python is duck-typing, PyTorch LR scheduler should not explicitly check for the optimizer class.
Thank you!
hi all, i was giving the CPUOffloadOptimizer a try and found two issues when using with QLoRA single device in torchtune:
When using a LR scheduler i got. Maybe there is a way to inherit the optimizer class?
When passing model.params() i got the error below. I imagine that a simple fix is to keep only params that require grad, like adamw implementation oes
cc: @gau-nernst