Closed yuanzhiyong1999 closed 2 years ago
Thanks for pointing that out! We just fixed the bug and that should resolve the problem. We append the data to be a multiple of #gpu for multi-gpu training, so len(self.perm) should be 9742.
Thanks for pointing that out! We just fixed the bug and that should resolve the problem. We append the data to be a multiple of #gpu for multi-gpu training, so len(self.perm) should be 9742.
Thank you for your reply, I reran the latest code and a new problem appeared:
(yzy-KEAR) jizhi2@jizhi2-MS-7A78:/media/jizhi2/软件/yzy/KEAR$ bash/task_train.sh
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
start is 1660199195.9345322
[1877331] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
start is 1660199196.0026214
[1877332] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
[1877332]: world_size = 2, rank = 1, backend=nccl
[1877331]: world_size = 2, rank = 0, backend=nccl
batch size: 2, total_batch_size: 10batch size: 2, total_batch_size: 10
clearing output folder.
args.fp16 is 0
args.fp16 is 0
load_vocab google/electra-large-discriminator
load_vocab google/electra-large-discriminator
load_data data/csqa_ret_3datasets/train_data.json
load_data data/csqa_ret_3datasets/train_data.json
data: 9742, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 9742, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 1222, world_size: 2
get dir test/
make dataloader ...
data: 1222, world_size: 2
get dir test/
make dataloader ...
max len: 200
95 percent len: 98
train_data 9742
total length: 2436
max len: 200
95 percent len: 98
train_data 9742
total length: 2436
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
model_type= electra
model_type= electra
init model finished.
init model finished.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.bias', 'scorer.csqa_ret_3datasets.scorer.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.bias', 'scorer.csqa_ret_3datasets.scorer.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-08-11 14:27:05,046 - __main__ - INFO - initializing trainer.
2022-08-11 14:27:05,046 - __main__ - INFO - initializing trainer.
Trainer: fp16 is 0
2022-08-11 14:27:05,730 - __main__ - INFO - initialize trainer finished.
2022-08-11 14:27:05,730 - __main__ - INFO - setting up optimizer
Trainer: fp16 is 0
2022-08-11 14:27:05,730 - __main__ - INFO - initialize trainer finished.
2022-08-11 14:27:05,730 - __main__ - INFO - setting up optimizer
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
warnings.warn(
2022-08-11 14:27:05,736 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
2022-08-11 14:27:05,737 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
load successfully.load successfully.
2022-08-11 14:27:05,739 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-11 14:27:05,739 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
File "task.py", line 410, in <module>
Traceback (most recent call last):
File "task.py", line 410, in <module>
srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
File "task.py", line 93, in train
File "task.py", line 93, in train
self.trainer.train(
self.trainer.train(
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 83, in train
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 83, in train
loss = self._step(batch)
loss = self._step(batch)
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 110, in _step
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 110, in _step
loss, _ = self._forward(batch, self.train_record, mode='train', dataset_name=self.config.data_version)
loss, _ = self._forward(batch, self.train_record, mode='train', dataset_name=self.config.data_version)
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 166, in _forward
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 166, in _forward
batch = clip_batch(batch)batch = clip_batch(batch)
File "/media/jizhi2/软件/yzy/KEAR/utils/tensor.py", line 25, in clip_batch
File "/media/jizhi2/软件/yzy/KEAR/utils/tensor.py", line 25, in clip_batch
this_loc = input_ids[i, :, -1].any() if num_dim == 3 else input_ids[i, -1]this_loc = input_ids[i, :, -1].any() if num_dim == 3 else input_ids[i, -1]
RuntimeErrorRuntimeError: : all only supports torch.uint8 and torch.bool dtypesall only supports torch.uint8 and torch.bool dtypes
Traceback (most recent call last):
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
main()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '1e-5', '--append_answer_text', '1', '--weight_decay', '0.01', '--preset_model_type', 'electra', '--batch_size', '2', '--max_seq_length', '50', '--num_train_epochs', '10', '--save_interval_step', '2', '--continue_train', '--print_number_per_epoch', '2', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--print_loss_step', '10', '--clear_output_folder']' returned non-zero exit status 1.
We have not encountered this problem. It looks like it could be problem with pytorch version. Which version of pytorch are you using? We recommend using the attached docker (see quickstart in Readme) to reproduce the results.
I'm currently using pytorch version 1.7.1, and I can't use docker for the next step at the moment due to the limitations.
Can you try pytorch 1.8.0 and transformers 4.10.2? Without the docker environment it would be hard to debug the problem. To avoid the problem you can also simply comment out the "clip_batch" references (it is only for optimizing the speed).
I commented out the "clip_batch" and it works, thank you very much for your help!
Hello, when I run deepspeed task.py --append_descr 1 --append_triples --append_retrieval 1 --data_version csqa_ret_3datasets --lr 4e-6 --append_answer_text 1 --weight_decay 0.1 --preset_model_type debertav3 --batch_size 4 --max_seq_length 512 --num_train_epochs 15 --save_interval_step 4 --continue_train --print_number_per_epoch 1 --vary_segment_id --seed 42 --warmup_proportion 0.1 --optimizer_type adamw --ddp --deepspeed --freq_rel 1
in the “task_train.sh” file, the prompt message is as follows:
2022-08-13 10:13:29,347 - utils.trainer - INFO - total n_step = 1218, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-13 10:13:29,348 - utils.trainer - INFO - total n_step = 1218, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
File "task.py", line 410, in <module>
srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
File "task.py", line 93, in train
self.trainer.train(
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
for step, batch in enumerate(train_looper):
File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
batch = next(self.dataloader_iter)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 71, in __getitem__
data = self.get_example(idx)
File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 89, in get_example
features.append(Feature.make_single(example.idx, main_tokens, context_tokens, self.tokenizer,
UnboundLocalError: local variable 'context_tokens' referenced before assignment
Traceback (most recent call last):
File "task.py", line 410, in <module>
srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
File "task.py", line 93, in train
self.trainer.train(
File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
for step, batch in enumerate(train_looper):
File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
batch = next(self.dataloader_iter)
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 71, in __getitem__
data = self.get_example(idx)
File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 89, in get_example
features.append(Feature.make_single(example.idx, main_tokens, context_tokens, self.tokenizer,
UnboundLocalError: local variable 'context_tokens' referenced before assignment
[2022-08-13 10:13:33,043] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2641145
[2022-08-13 10:13:33,044] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2641146
[2022-08-13 10:13:33,070] [ERROR] [launch.py:292:sigkill_handler] ['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--append_triples', '--append_retrieval', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '4e-6', '--append_answer_text', '1', '--weight_decay', '0.1', '--preset_model_type', 'debertav3', '--batch_size', '4', '--max_seq_length', '512', '--num_train_epochs', '15', '--save_interval_step', '4', '--continue_train', '--print_number_per_epoch', '1', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--deepspeed', '--freq_rel', '1'] exits with return code = 1
What is the reason for this, please?
Have you pulled the latest version? Our latest version should not have this problem any more. If you check the implementation at https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/specific/tensor.py#L89 the context_tokens is always defined. The old version might have the problem you mentioned.
I'm currently using the latest version of the code and I don't know why this problem still occurs. There is another question I would like to ask: if I don't use deepspeed, what changes do I need to make and how many changes do I need to make?
Can you check the code here: https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/specific/tensor.py#L83 Is it "if" or "elif"? If it is "elif" as in old version the above error will appear. If your code is indeed new you may want to debug into the lines in tensor.py to see what happened - it is unlikely that error will pop up. Our code support using pytorch DDP instead of deepspeed. You can checkout the example here: https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/bash/task_train.sh#L13
I have the following problem when I run it:
I debugged and found that the value of len(self.perm) is 9741 and the value of self.total_size is 9742. What is the reason for this?