stochasticai / xTuring

Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6
https://xturing.stochastic.ai
Apache License 2.0
2.61k stars 207 forks source link

Fine-tune llama2-7b base(not using lora) with 8-A100 GPU(AWS ml.p4de.24xlarge) is not available. #267

Closed WL0118 closed 1 year ago

WL0118 commented 1 year ago

I got the following error when I tried to train llama2 7b model without LORA(base). Please help me with this problem.

SYSTEM ENV: Sagemaker ml.p4de.24xlarge AWS instance pytorch_p310 image

ERROR:

[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -lcurand -L/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-11.8/lib64 -lcudart -o cpu_adam.so Loading extension module cpu_adam... Time to load cpu_adam op: 46.40223288536072 seconds Loading extension module cpu_adam... Loading extension module cpu_adam... Loading extension module cpu_adam... Loading extension module cpu_adam... Time to load cpu_adam op: 46.353471755981445 seconds Time to load cpu_adam op: 46.35285019874573 seconds Time to load cpu_adam op: 46.35269808769226 seconds Time to load cpu_adam op: 46.353540897369385 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.11113476753235 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.442068576812744 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 46.448280572891235 seconds Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Adam Optimizer #0 is created with AVX512 arithmetic capability. Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1 Finding best initial lr: 0%| | 0/100 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:0 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:3 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:6 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:1 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:2 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s] Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, args, trainer=trainer, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:5 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s] self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:7 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Traceback (most recent call last): File "/home/ec2-user/SageMaker/train.py", line 24, in model.finetune(dataset=dataset) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/models/causal.py", line 119, in finetune trainer.fit() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/xturing/trainers/lightning_trainer.py", line 187, in fit self.trainer.fit(self.lightning_model) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit call._call_and_handle_interrupt( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch return function(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 970, in _run call._call_callback_hooks(self, "on_fit_start") File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 208, in _call_callback_hooks fn(trainer, trainer.lightning_module, *args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 126, in on_fit_start self.lr_find(trainer, pl_module) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/callbacks/lr_finder.py", line 110, in lr_find self.optimal_lr = _lr_find( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 275, in _lr_find _try_loop_run(trainer, params) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/tuner/lr_finder.py", line 515, in _try_loop_run loop.run() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run self.advance() File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance self.epoch_loop.run(self._data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 136, in run self.advance(data_fetcher) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 240, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 187, in run self._optimizer_step(batch_idx, closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 265, in _optimizer_step call._call_lightning_module_hook( File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1282, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 151, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 263, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 230, in optimizer_step return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 117, in optimizer_step return optimizer.step(closure=closure, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(args, kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, *kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/ec2-user/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/deepspeed/ops/adam/cpu_adam.py", line 150, in step assert p.device == device, f"CPUAdam param is on {p.device} and must be 'cpu', make " \ AssertionError: CPUAdam param is on cuda:4 and must be 'cpu', make sure you enabled 'offload_optimizer': 'cpu' in your ZeRO config. Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s] Finding best initial lr: 0%| | 0/100 [00:22<?, ?it/s] Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s] Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s] Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s] Finding best initial lr: 0%| | 0/100 [00:23<?, ?it/s]

StochasticRomanAgeev commented 1 year ago

Hi @WL0118, What is the data set size you are using? Probably better for your case is to use one A100 and LoRA, It will reduce costs and be much faster. We will also check the issue with gpu's, but most probable reason - our configuration is for data=parallel finetuning, not for splitting model between GPU's.

WL0118 commented 1 year ago

Hi @WL0118, What is the data set size you are using? Probably better for your case is to use one A100 and LoRA, It will reduce costs and be much faster. We will also check the issue with gpu's, but most probable reason - our configuration is for data=parallel finetuning, not for splitting model between GPU's.

Hi, Thank you for the comment.

I used a small text file(<20MB)

Since I wanted to change the model weight directly, I decided not to use LORA.

Personally, I solved this problem by using the 'Deepspeed' directly instead of using Xturing training.

I am gonna close this issue since it looks like I am the only guy who needs this function.

Thank you.