shuxueslpi / chatGLM-6B-QLoRA

使用peft库,对chatGLM-6B/chatGLM2-6B实现4bit的QLoRA高效微调,并做lora model和base model的merge及4bit的量化(quantize)。
349 stars 46 forks source link

BrokenPipeError: [Errno 32] Broken pipe #40

Open plutoda588 opened 11 months ago

plutoda588 commented 11 months ago

Thanks you ,My english is senior,excuse me. when i run the CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32. I get the error : (base) root@461jc47ml0du4-0:/T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main# CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00, 1.38s/it] trainable params: 1,835,008 || all params: 6,175,121,408 || trainable%: 0.029716144489446126 Found cached dataset json (/root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 425.26it/s] Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-b1ad1cf49d010a09.arrow Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-7f22050519838b48.arrow Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-82c52662d9060a3a.arrow Found cached dataset json (/root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1132.07it/s] Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-d76e708a4953afce.arrow Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-466cf4ff38bae650.arrow Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-74e21cf8bae50992.arrow wandb: Currently logged in as: 2315553823 (fky_hbj). Use wandb login --relogin to force relogin wandb: Tracking run with wandb version 0.15.10 wandb: Run data is saved locally in /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/wandb/run-20230920_070236-anybjcsm wandb: Run wandb offline to turn off syncing. wandb: Syncing run kind-sunset-17 wandb: ⭐️ View project at https://wandb.ai/fky_hbj/huggingface wandb: 🚀 View run at https://wandb.ai/fky_hbj/huggingface/runs/anybjcsm 0%| | 0/3581 [00:00<?, ?it/s]use_cache=True is incompatible with gradient checkpointing. Setting use_cache=False... 0%| | 2/3581 [00:25<12:28:26, 12.55s/it]Exception in thread NetStatThr: Traceback (most recent call last): File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 267, in check_network_status self._loop_check_status( File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status Exception in thread IntMsgThr: local_handle = request() Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 735, in deliver_network_status File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner return self._deliver_network_status(status) self.run() File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 466, in _deliver_network_status File "/opt/conda/lib/python3.8/threading.py", line 870, in run return self._deliver_record(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record self._target(self._args, self._kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 299, in check_internal_messages handle = mailbox._deliver_record(record, interface=self) self._loop_check_status( File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status local_handle = request() File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 743, in deliver_internal_messages return self._deliver_internal_messages(internal_message) interface._publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _deliver_internal_messages File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish return self._deliver_record(record) self.send_server_request(server_req) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request handle = mailbox._deliver_record(record, interface=self) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record self._send_message(msg) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) interface._publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe self._sock_client.send_record_publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe 3%|███▍ | 100/3581 [18:37<10:43:05, 11.08s/it]Traceback (most recent call last): File "train_qlora.py", line 206, in train(args) File "train_qlora.py", line 200, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate self.log(logs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2595, in log self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 399, in on_log return self.call_event("on_log", args, state, control, logs=logs) File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 406, in call_event result = getattr(callback, event)( File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/integration_utils.py", line 803, in on_log self._wandb.log({logs, "train/global_step": state.global_step}) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 419, in wrapper return func(self, args, kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 370, in wrapper_fn return func(self, *args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 360, in wrapper return func(self, args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1792, in log self._log(data=data, step=step, commit=commit) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1567, in _log self._partial_history_callback(data, step, commit) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1439, in _partial_history_callback self._backend.interface.publish_partial_history( File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 546, in publish_partial_history self._publish_partial_history(partial_history) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history self._publish(rec) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe

I wonder if that's the problem of the wandb

plutoda588 commented 11 months ago

@shuxueslpi

shuxueslpi commented 11 months ago

pip uninstall wandb if you don't use it

plutoda588 commented 11 months ago

I need use it,but i met the question which is 'BrokenPipeError: [Errno 32] Broken pipe', I set the num_workers =0 or 1,the problem still exists.My environmennt is all set up. @shuxueslpi

plutoda588 commented 11 months ago

@shuxueslpi Thanks you ,I had solved the issue .and I would like to ask you about your environment, speed and data amount, because my speed is relatively slow in the environment specified in the document, 40s/it, and now it is slightly better to 15s/it, I don't know whether it is normal.

lizhaoliu-Lec commented 11 months ago

@shuxueslpi Thanks you ,I had solved the issue .and I would like to ask you about your environment, speed and data amount, because my speed is relatively slow in the environment specified in the document, 40s/it, and now it is slightly better to 15s/it, I don't know whether it is normal.

I wonder how you solve it. The wandb BrokenPipeError has been tortured me the whole day. Help me if you can

lizhaoliu-Lec commented 11 months ago

Solve the issue by downgrade wandb to 0.13.1

plutoda588 commented 10 months ago

Thanks you ,the problem I had solved by uninstall wandb.