josStorer / RWKV-Runner

A RWKV management and startup tool, full automation, only 8MB. And provides an interface compatible with the OpenAI API. RWKV is a large language model that is fully open source and available for commercial use.
https://www.rwkv.com
MIT License
5.31k stars 502 forks source link

lora微调训练问题 #347

Open zigui123340 opened 5 months ago

zigui123340 commented 5 months ago

我有两张gpu,一张P40一张1070,架构都是帕斯卡,正常推理使用没问题,但是训练的时候出现以下情况: Traceback (most recent call last): File "/mnt/e/rwkv/./finetune/lora/v6/train.py", line 540, in trainer.fit(model, data_loader) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 88, in launch return function(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1049, in _run self.setup_profiler() File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1509, in setup_profiler self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/trainer.py", line 1826, in log_dir dirpath = self.strategy.broadcast(dirpath) File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 314, in broadcast torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2411, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1910, in broadcast work = default_pg.broadcast([tensor], opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: nvmlDeviceGetHandleByIndex(0) failed: Unknown Erro

经过多次网络搜索都未能解决,希望能得到解答

zigui123340 commented 5 months ago

训练参数 image