Closed ZizoAdam closed 1 year ago
Hi
While waiting @pacman100 's comment maybe , you can check what's the shape of self.wte
. It would be a good idea to double check if the issue also happens without the usage of deepspeed.
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 634, in forward
inputs_embeds = self.wte(input_ids)
The issue does not happen without deepspeed, however we are unable to train without deepspeed due to not having much in the way of system resources.
DeepSpeed version and how are you launching the script?
Deepspeed 0.9.5, just launching it with python3 script.py
Thought so, please use distributed launcher such as torchrun
, deepspeed
or accelerate
when using DeepSpeed/DDP/FSDP or anytime you are doing distributed training.
Please refer:
that should resolve the issue
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
i also have the same problem. also deepspeed stage3 with trainner. @ZizoAdam do u solve the problem?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@yuxyang88 if the solution did not work for you, feel free to open a new issue with a reproducer (as small as possible) making sure you are using the lastest version of transformers.
因此,请使用分布式启动器,例如
torchrun
,deepspeed
或accelerate
在使用 DeepSpeed/DDP/FSDP 时或在进行分布式训练时使用。请参考:
My program reported the same error (RuntimeError: 'weight' must be 2-D
), but I started the distributed training with deepspeed, I do not understand your answer, why do you think it can solve the problem?
Hi @nomadlx
Please open a new issue with a reproducer (as small as possible but complete).
Also making sure you are using the lastest version of transformers / accelerate too.
Thanks.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
我也遇到了相同问题,在我将transformers的版本从4.35.0换到4.31.0之后问题解决了
I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.
I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.
I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:
1) Make sure calling deepspeed
correcty (e.g. deepspeed --num_gpus <> --master_port=<> xxx.py
when launching the training job. This should solve most of the cases if you are just training a single model.
2) For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with from_pretrained
before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL or copy.deepcopy()
all yields incorrect deepspeed initializations. You may see error like this:
Tensors must be 2-D
AssertionError: {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {456}, 'ds_tensor.shape': torch.Size([0])} : {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_m
3) These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.
I got the "weight" must be 2-D"
issue using zero 3 with the TRL library to do DPO. I was also using the PEFT library to add two LoRA adapters to the model (one for the reference and one for the trained model).
Solution: I removed the embedding layer as a target module in the LoRA configs and it worked. I'm not sure why, but since the stack trace had
File "/home/augustus/miniconda3/envs/adamTraining/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
I just tried removing it
I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.
I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:
- Make sure calling
deepspeed
correcty (e.g.deepspeed --num_gpus <> --master_port=<> xxx.py
when launching the training job. This should solve most of the cases if you are just training a single model.- For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with
from_pretrained
before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL orcopy.deepcopy()
all yields incorrect deepspeed initializations. You may see error like this:
Tensors must be 2-D
AssertionError: {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {456}, 'ds_tensor.shape': torch.Size([0])} : {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_m
- These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.
How to fix the bug "Tensors must be 2-D"?
I got the same issue even downgraded transformers from 4.35.0 to 4.31.0 as Hagtaril commented, with deepspeed. Anyone resolved the issue? My deedspeed version is 0.10.0. It worked well without deepspeed.
I got a same issue and worked it out after a day. I got the issue when training DPO and PPO with Huggingface trl library. The cause of these errors roots in incorrrect initialization of deepspeed for your model. To solve this issue, you can double-check:
- Make sure calling
deepspeed
correcty (e.g.deepspeed --num_gpus <> --master_port=<> xxx.py
when launching the training job. This should solve most of the cases if you are just training a single model.- For trickier scenerios (training DPO or PPO), please make sure ALL models are correctly initialized with deepspeed. Huggingface's TRL library have some bugs in initializing deepspeed for the reference model, reward model, etc. So, it is safect to initialize each model with
from_pretrained
before passing to Huggingface trainer classes. On the contrary, initializing reference models with TRL orcopy.deepcopy()
all yields incorrect deepspeed initializations. You may see error like this:
Tensors must be 2-D
AssertionError: {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {456}, 'ds_tensor.shape': torch.Size([0])} : {'id': 291, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0,), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_m
- These errors above cannot be solved with a downgrade to 4.31.0. Also, I personally do not think downgrading as a good solution, as we will depend on new architectures and features (e.g. MistrialForCausalLM) in the future versions.
How to fix the bug "Tensors must be 2-D"?
Initialize each model (reference and policy) with from_pretrained
before passing to Huggingface trainer classes
System Info
transformers
version: 4.30.2Who can help?
@pacman100 @sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The dataset being used is my own dataset that is just a few hundred strings in a CSV file produced by pandas.
Running the following code
using the following config file
Causes an error at trainer.train()
Expected behavior
I would expect training to begin or a more verbose error to help fix the issue (if possible to do so from my side)