Open plutoda588 opened 11 months ago
@shuxueslpi
pip uninstall wandb if you don't use it
I need use it,but i met the question which is 'BrokenPipeError: [Errno 32] Broken pipe', I set the num_workers =0 or 1,the problem still exists.My environmennt is all set up. @shuxueslpi
@shuxueslpi Thanks you ,I had solved the issue .and I would like to ask you about your environment, speed and data amount, because my speed is relatively slow in the environment specified in the document, 40s/it, and now it is slightly better to 15s/it, I don't know whether it is normal.
@shuxueslpi Thanks you ,I had solved the issue .and I would like to ask you about your environment, speed and data amount, because my speed is relatively slow in the environment specified in the document, 40s/it, and now it is slightly better to 15s/it, I don't know whether it is normal.
I wonder how you solve it. The wandb BrokenPipeError has been tortured me the whole day. Help me if you can
Solve the issue by downgrade wandb to 0.13.1
Thanks you ,the problem I had solved by uninstall wandb.
Thanks you ,My english is senior,excuse me. when i run the CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32. I get the error : (base) root@461jc47ml0du4-0:/T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main# CUDA_VISIBLE_DEVICES=1 python3 train_qlora.py --train_args_json chatGLM_6B_QLoRA.json --model_name_or_path /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/remote_scripts/chatglm-6b/ --train_data_path data/train.jsonl --eval_data_path data/dev.jsonl --lora_rank 4 --lora_dropout 0.05 --compute_dtype fp32 Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00, 1.38s/it] trainable params: 1,835,008 || all params: 6,175,121,408 || trainable%: 0.029716144489446126 Found cached dataset json (/root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 425.26it/s] Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-b1ad1cf49d010a09.arrow Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-7f22050519838b48.arrow Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-7be044e55537389d/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-82c52662d9060a3a.arrow Found cached dataset json (/root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4) 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1132.07it/s] Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-d76e708a4953afce.arrow Loading cached shuffled indices for dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-466cf4ff38bae650.arrow Loading cached processed dataset at /root/.cache/huggingface/datasets/json/default-15aa3e3de12fc81f/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4/cache-74e21cf8bae50992.arrow wandb: Currently logged in as: 2315553823 (fky_hbj). Use
train(args)
File "train_qlora.py", line 200, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1553, in train
return inner_training_loop(
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 1927, in _inner_training_loop
self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2240, in _maybe_log_save_evaluate
self.log(logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer.py", line 2595, in log
self.control = self.callback_handler.on_log(self.args, self.state, self.control, logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 399, in on_log
return self.call_event("on_log", args, state, control, logs=logs)
File "/opt/conda/lib/python3.8/site-packages/transformers/trainer_callback.py", line 406, in call_event
result = getattr(callback, event)(
File "/opt/conda/lib/python3.8/site-packages/transformers/integrations/integration_utils.py", line 803, in on_log
self._wandb.log({ logs, "train/global_step": state.global_step})
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 419, in wrapper
return func(self, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 370, in wrapper_fn
return func(self, *args, *kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 360, in wrapper
return func(self, args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1792, in log
self._log(data=data, step=step, commit=commit)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1567, in _log
self._partial_history_callback(data, step, commit)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 1439, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 546, in publish_partial_history
self._publish_partial_history(partial_history)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
wandb login --relogin
to force relogin wandb: Tracking run with wandb version 0.15.10 wandb: Run data is saved locally in /T106/chatGLM-6B-QLoRA-main/chatGLM-6B-QLoRA-main/wandb/run-20230920_070236-anybjcsm wandb: Runwandb offline
to turn off syncing. wandb: Syncing run kind-sunset-17 wandb: ⭐️ View project at https://wandb.ai/fky_hbj/huggingface wandb: 🚀 View run at https://wandb.ai/fky_hbj/huggingface/runs/anybjcsm 0%| | 0/3581 [00:00<?, ?it/s]use_cache=True
is incompatible with gradient checkpointing. Settinguse_cache=False
... 0%| | 2/3581 [00:25<12:28:26, 12.55s/it]Exception in thread NetStatThr: Traceback (most recent call last): File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.8/threading.py", line 870, in run self._target(*self._args, self._kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 267, in check_network_status self._loop_check_status( File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status Exception in thread IntMsgThr: local_handle = request() Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 735, in deliver_network_status File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner return self._deliver_network_status(status) self.run() File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 466, in _deliver_network_status File "/opt/conda/lib/python3.8/threading.py", line 870, in run return self._deliver_record(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record self._target(self._args, self._kwargs) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 299, in check_internal_messages handle = mailbox._deliver_record(record, interface=self) self._loop_check_status( File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status local_handle = request() File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 743, in deliver_internal_messages return self._deliver_internal_messages(internal_message) interface._publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 472, in _deliver_internal_messages File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish return self._deliver_record(record) self.send_server_request(server_req) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 425, in _deliver_record File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request handle = mailbox._deliver_record(record, interface=self) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record self._send_message(msg) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) interface._publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe self._sock_client.send_record_publish(record) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/opt/conda/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe 3%|███▍ | 100/3581 [18:37<10:43:05, 11.08s/it]Traceback (most recent call last): File "train_qlora.py", line 206, inI wonder if that's the problem of the wandb