Closed WL0118 closed 1 year ago
Hi @WL0118, What is the data set size you are using? Probably better for your case is to use one A100 and LoRA, It will reduce costs and be much faster. We will also check the issue with gpu's, but most probable reason - our configuration is for data=parallel finetuning, not for splitting model between GPU's.
Hi @WL0118, What is the data set size you are using? Probably better for your case is to use one A100 and LoRA, It will reduce costs and be much faster. We will also check the issue with gpu's, but most probable reason - our configuration is for data=parallel finetuning, not for splitting model between GPU's.
Hi, Thank you for the comment.
I used a small text file(<20MB)
Since I wanted to change the model weight directly, I decided not to use LORA.
Personally, I solved this problem by using the 'Deepspeed' directly instead of using Xturing training.
I am gonna close this issue since it looks like I am the only guy who needs this function.
Thank you.
I got the following error when I tried to train llama2 7b model without LORA(base). Please help me with this problem.
SYSTEM ENV: Sagemaker ml.p4de.24xlarge AWS instance pytorch_p310 image
ERROR:
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-11.8/lib64 -lcudart -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 46.40223288536072 seconds Loading extension module cpu_adam... Loading extension module cpu_adam... Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 46.353471755981445 seconds Time to load cpu_adam op: 46.35285019874573 seconds Time to load cpu_adam op: 46.35269808769226 seconds Time to load cpu_adam op: 46.353540897369385 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.11113476753235 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.442068576812744 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.448280572891235 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Finding best initial lr: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:3 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:6 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:1 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:2 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s]
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:5 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s]
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:7 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/train.py", line 24, in
model.finetune(dataset=dataset)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune
trainer.fit()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit
self.trainer.fit(self.lightning_model)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run
call._call_callback_hooks(self, "on_fit_start")
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks
fn(trainer, trainer.lightning_module, *args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start
self.lr_find(trainer, pl_module)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find
self.optimal_lr = _lr_find(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find
_try_loop_run(trainer, params)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run
loop.run()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
self.advance()
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run
self.advance(data_fetcher)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance
batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run
self._optimizer_step(batch_idx, closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step
call._call_lightning_module_hook(
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step
step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step
return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step
return optimizer.step(closure=closure, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(args, kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, *kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step
assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \
AssertionError: CPUAdam param is on cuda:4 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config.
Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s]
Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s]
Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s]
Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s]
Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s]
Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s]