Open 128Ghe980 opened 9 months ago
I load QW like this : create_hf_model( model_class=AutoModelForCausalLM, model_name_or_path=actor_model_name_or_path, tokenizer=self.tokenizer, ds_config=ds_config, dropout=self.args.actor_dropout) without any error. Maybe the version of transformers or deepspeed is not right.
I'm trying to use DeepSpeed-Chat stage2 scripts to do rlhf with Qwen1.8b-chat model,I change some parts in dschat and main.py to load my model, the most different part is:
I load my model by "AutoModelForCausalLM" not "AutoModel" in model_utils.py but it stiil has some problems
the error: Running training Evaluating reward, Epoch 0/1 Beginning of Epoch 1/1, Total Micro Batches 8479 /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:391: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( Traceback (most recent call last): File "/home/tione/notebook/code/RLHF/main.py", line 441, in
main()
File "/home/tione/notebook/code/RLHF/main.py", line 390, in main
outputs = rm_model(batch, use_cache=False)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _call_impl
return forward_call(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1769, in forward
loss = self.module(*inputs, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1548, in _call_impl
result = forward_call(*args, kwargs)
File "/home/tione/notebook/code/RLHF/dschat/utils/model/reward_model.py", line 57, in forward
transformer_outputs = self.rwtransformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1548, in _call_impl
result = forward_call(args, kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/modeling_qwen.py", line 1060, in forward
lm_logits = self.lm_head(hidden_states)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1537, in _call_impl
result = hook(self, args)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 382, in _pre_forward_module_hook
self.pre_sub_module_forward_function(module)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 494, in pre_sub_module_forward_function
param_coordinator.fetch_sub_module(sub_module, forward=True)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 303, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 196, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 2048), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {297}, 'ds_tensor.shape': torch.Size([0])}
[2024-02-22 16:15:14,956] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12019
[2024-02-22 16:15:14,956] [ERROR] [launch.py:321:sigkill_handler] ['/usr/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'local/jsonfile', '--data_split', '2,4,4', '--model_name_or_path', '/home/tione/notebook/model/Qwen-1_8B-Chat/', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_padding_at_beginning', '0', '--num_train_epochs', '1', '--gradient_accumulation_steps', '4', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--gradient_checkpointing', '--zero_stage', '3', '--deepspeed', '--output_dir', './output'] exits with return code = 1
it occurs that something wrong happens when it comes to: outputs = rm_model(**batch, use_cache=False)
can any one help?
the main.py:
the .sh file: