xing-ye commented 2 years ago

File "train.py", line 591, in main() File "train.py", line 555, in main train_result = trainer.train(model_path=model_path) File "/mnt/data/data/home/zhanghaoran/learn_project/SimCSE-main/simcse/trainers.py", line 464, in train tr_loss += self.training_step(model, inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/trainer.py", line 1248, in training_step loss = self.compute_loss(model, inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/trainer.py", line 1277, in compute_loss outputs = model(*inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, *kwargs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward return self.gather(outputs, self.output_device) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather return gather(outputs, output_device, dim=self.dim) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather res = gather_map(outputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in gather_map return type(out)((k, gather_map([d[k] for d in outputs])) File "", line 7, in init File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/file_utils.py", line 1383, in __post_init__ for element in iterator: File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in return type(out)((k, gather_map([d[k] for d in outputs])) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return Gather.apply(target_device, dim, outputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 235, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [32, 32], but expected [32, 33]

gaotianyu1350 commented 2 years ago

Hi,

Can you provide more details? For example, what is the script?

xing-ye commented 2 years ago

Thank you very much for your reply， I just ran run_unsup_example.sh without modifying the parameter settings in it.There are no errors in the training process, but this error occurs when the training is almost 100%. Finally, the last part of the result saved in thetrainer_state.json file is as follows:

    {
      "epoch": 0.85,
      "eval_avg_sts": 0.7686240710551472,
      "eval_sickr_spearman": 0.7343745994285155,
      "eval_stsb_spearman": 0.8028735426817789,
      "step": 6625
    },
    {
      "epoch": 0.86,
      "eval_avg_sts": 0.7711475035762223,
      "eval_sickr_spearman": 0.7350268137549437,
      "eval_stsb_spearman": 0.8072681933975009,
      "step": 6750
    }
  ],
  "max_steps": 7813,
  "num_train_epochs": 1,
  "total_flos": 0,
  "trial_name": null,
  "trial_params": null
}

leedhn commented 2 years ago

I have same error.

I just edited some lines in trainers.py to naively remove this error..

line 450~

            for step, inputs in enumerate(epoch_iterator):
                # Skip past any already trained steps if resuming training
                if steps_trained_in_current_epoch > 0:
                    steps_trained_in_current_epoch -= 1
                    continue

                if (step + 1) % self.args.gradient_accumulation_steps == 0:
                    self.control = self.callback_handler.on_step_begin(self.args, self.state, self.control)

                if ((step + 1) % self.args.gradient_accumulation_steps != 0) and self.args.local_rank != -1:
                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
                    with model.no_sync():
                        tr_loss += self.training_step(model, inputs)
                else:
                    #edited---------------------------------
                    # last batch's shape error -> put zero paddings
                    x,y,z = inputs['attention_mask'].size()
                    if x % 7 != 0 : 
                        for key in inputs.keys():
                            inputs[key] = torch.cat([inputs[key], torch.zeros([7-(x%7), y, z],dtype=torch.long)], dim=0)
                    #---------------------------------------
                    tr_loss += self.training_step(model, inputs)
                self._total_flos += self.floating_point_ops(inputs)

Perhaps 7 is the number of GPUs. But I'm not sure about this.

gaotianyu1350 commented 2 years ago

Hi,

The unsupervised script is for one GPU use. According to the log it seems that you used multiple GPUs. If you want to use distributed training, please refer to the supervised training code; otherwise please make sure you only use one GPU.

xing-ye commented 2 years ago

I have same error.

I just edited some lines in trainers.py to naively remove this error..

line 450~

            for step, inputs in enumerate(epoch_iterator):
                # Skip past any already trained steps if resuming training
                if steps_trained_in_current_epoch > 0:
                    steps_trained_in_current_epoch -= 1
                    continue

                if (step + 1) % self.args.gradient_accumulation_steps == 0:
                    self.control = self.callback_handler.on_step_begin(self.args, self.state, self.control)

                if ((step + 1) % self.args.gradient_accumulation_steps != 0) and self.args.local_rank != -1:
                    # Avoid unnecessary DDP synchronization since there will be no backward pass on this example.
                    with model.no_sync():
                        tr_loss += self.training_step(model, inputs)
                else:
                    #edited---------------------------------
                    # last batch's shape error -> put zero paddings
                    x,y,z = inputs['attention_mask'].size()
                    if x % 7 != 0 : 
                        for key in inputs.keys():
                            inputs[key] = torch.cat([inputs[key], torch.zeros([7-(x%7), y, z],dtype=torch.long)], dim=0)
                    #---------------------------------------
                    tr_loss += self.training_step(model, inputs)
                self._total_flos += self.floating_point_ops(inputs)

Perhaps 7 is the number of GPUs. But I'm not sure about this.

Thank you very much for your answer, I think you are right, I followed your method to change the code and the project runs successfully, and 7 should be the number of GPUS. Sorry for the late reply to you.

xndong commented 2 years ago

What should I do in pytorch or transformers to make sure only one GPU is used. I mean how to set up one GPU?

zliguo commented 1 year ago

I also encountered this problem because of the GPU problem. You can set the number of GPUs on your device in the script and perform distributed training to solve this problem. `#!/bin/bash

In this example, we show how to train SimCSE on unsupervised Wikipedia data.

If you want to train it with multiple GPU cards, see "run_sup_example.sh"

about how to use PyTorch's distributed data parallel.

new

NUM_GPU=2

PORT_ID=$(expr $RANDOM + 1000)

export OMP_NUM_THREADS=8

python -m torch.distributed.launch --nproc_per_node $NUM_GPU --master_port $PORT_ID train.py \ --model_name_or_path bert-base-uncased \ --train_file data/wiki1m_for_simcse.txt \ --output_dir result/my-unsup-simcse-bert-base-uncased \ --num_train_epochs 1 \ --per_device_train_batch_size 64 \ --learning_rate 3e-5 \ --max_seq_length 32 \ --evaluation_strategy steps \ --metric_for_best_model stsb_spearman \ --load_best_model_at_end \ --eval_steps 125 \ --pooler_type cls \ --mlp_only_train \ --overwrite_output_dir \ --temp 0.05 \ --do_train \ --do_eval \ --fp16 \ "$@" `

princeton-nlp / SimCSE

Error when I run unsupervised：RuntimeError: Input tensor at index 1 has invalid shape [32, 32], but expected [32, 33] #156

In this example, we show how to train SimCSE on unsupervised Wikipedia data.

If you want to train it with multiple GPU cards, see "run_sup_example.sh"

about how to use PyTorch's distributed data parallel.

new