what does it mean "$@" in run_sup_examples.sh

seona0111 commented 2 years ago

When I used "Pycharm" , I put the argv list in Script parameters. However, It occur errors because of "$@" script parameter. That's why I removed "$@" in script parameters of pycharm, but It also occur errors during training, the error message is below(when it calls loss=self.compute_loss). lib/python3.6/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 3 has invalid shape [34, 68], but expected [34, 74] 33%|█████████████▎ | 538/1617 [05:39<11:20, 1.59it/s]

running script is below and it same as yours sh file without $@ parameters. python3 -u /nlp2/SimCSE/train.py --model_name_or_path bert-base-uncased --train_file data/nli_for_simcse.csv --output_dir result/my-sup-simcse-bert-base-uncased --num_train_epochs 3 --per_device_train_batch_size 128 --learning_rate 5e-5 --max_seq_length 32 --evaluation_strategy steps --metric_for_best_model stsb_spearman --load_best_model_at_end --eval_steps 125 --pooler_type cls --overwrite_output_dir --temp 0.05 --do_train --do_eval --fp16

How can I use "Pycharm"?

gaotianyu1350 commented 2 years ago

Hi,

It seems to be a GPU communication-related error. Maybe try limiting the number of GPUs to 1 and try again.

$@ means taking the arguments of the bash script. You can safely delete it.

IT-coach-666 commented 2 years ago

Hi, I also encounterd the similar errors, shown below:

{'loss': 0.0002, 'learning_rate': 1.1731719914410094e-07, 'epoch': 1.0} {'eval_stsb_spearman': 0.7758524564425994, 'eval_sickr_spearman': 0.7358462723765793, 'eval_avg_sts': 0.7558493644095894, 'epoch': 1.0} Traceback (most recent call last): File "train.py", line 585, in main() File "train.py", line 549, in main train_result = trainer.train(model_path=model_path) File "/home/huangjiayue/04_SimCSE/SimCSE/simcse/trainers.py", line 464, in train tr_loss += self.training_step(model, inputs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/transformers/trainer.py", line 1248, in training_step loss = self.compute_loss(model, inputs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/transformers/trainer.py", line 1277, in compute_loss outputs = model(*inputs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, *kwargs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "", line 6, in init File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/transformers/file_utils.py", line 1397, in __post_init__ for element in iterator: File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in for k in out)) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, outputs) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/huangjiayue/anaconda3/envs/jy-pharm_paper_search/lib/python3.6/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [62, 62], but expected [62, 63]

Does it also a GPU communication-related error?

gaotianyu1350 commented 2 years ago

Hi,

It seems that you are using the single GPU script on multiple GPUs. When you use the single GPU script, please make sure only using one GPU.

IT-coach-666 commented 2 years ago

Hi,

It seems that you are using the single GPU script on multiple GPUs. When you use the single GPU script, please make sure only using one GPU.

Hi, The script I used is shown below:

NUM_GPU=2 PORT_ID=$(expr $RANDOM + 1000)

export OMP_NUM_THREADS=8

python -m torch.distributed.launch --nproc_per_node $NUM_GPU --master_port $PORT_ID train.py \ --model_name_or_path bert-base-uncased \ --train_file data/100w-paper_info_en.txt \ --output_dir result/unsup-simcse-bert-base-uncased_100w-en-tlt-abst-epoch1 \ --num_train_epochs 10 \ --per_device_train_batch_size 64 \ --learning_rate 3e-5 \ --max_seq_length 32 \ --evaluation_strategy steps \ --metric_for_best_model stsb_spearman \ --load_best_model_at_end \ --eval_steps 125 \ --pooler_type cls \ --mlp_only_train \ --overwrite_output_dir \ --temp 0.05 \ --do_train \ --do_eval \ --fp16 \ "$@"

And how to differenciate a single GPU script or multiple GPU script? Could you show me the multiple GPU script?

gaotianyu1350 commented 2 years ago

Hi,

This is a multi-gpu script. The unsupervised example we offered is single-GPU, and the supervised example is multi-GPU. The single GPU script doesn't have "torch.distributed.launch". Here in this script you specify 2 GPUs, so make sure there are two GPUs available.

princeton-nlp / SimCSE

what does it mean "$@" in run_sup_examples.sh #147