microsoft / KEAR

Official code for achieving human parity on CommonsenseQA with External Attention
106 stars 25 forks source link

AssertionError #12

Closed yuanzhiyong1999 closed 2 years ago

yuanzhiyong1999 commented 2 years ago

I have the following problem when I run it:

(yzy-KEAR) jizhi2@jizhi2-MS-7A78:/media/jizhi2/软件/yzy/KEAR$ bash/task_train.sh
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
start is 1660190764.6195216start is 1660190764.6195297

[1858634] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
[1858635] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
[1858635]: world_size = 2, rank = 1, backend=nccl
[1858634]: world_size = 2, rank = 0, backend=nccl
batch size: 2, total_batch_size: 10batch size: 2, total_batch_size: 10

clearing output folder.
args.fp16 is 0
args.fp16 is 0
load_vocab google/electra-large-discriminator
load_vocab google/electra-large-discriminator
load_data data/csqa_ret_3datasets/train_data.json
load_data data/csqa_ret_3datasets/train_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 1222, world_size: 2
get dir test/
make dataloader ...
data: 1222, world_size: 2
get dir test/
make dataloader ...
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
model_type= electra
model_type= electra
init model finished.
init model finished.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
Trainer: fp16 is 0
2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
Trainer: fp16 is 02022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer

2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
2022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
load successfully.load successfully.

2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
  File "task.py", line 410, in <module>
Traceback (most recent call last):
  File "task.py", line 410, in <module>
        srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)

  File "task.py", line 93, in train
  File "task.py", line 93, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
    index = self._next_index()  # may raise StopIteration
      File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
    index = self._next_index()  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
    assert len(self.perm) == self.total_size
AssertionError
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
    assert len(self.perm) == self.total_size
AssertionError
Traceback (most recent call last):
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '1e-5', '--append_answer_text', '1', '--weight_decay', '0.01', '--preset_model_type', 'electra', '--batch_size', '2', '--max_seq_length', '50', '--num_train_epochs', '10', '--save_interval_step', '2', '--continue_train', '--print_number_per_epoch', '2', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--print_loss_step', '10', '--clear_output_folder']' returned non-zero exit status 1.

I debugged and found that the value of len(self.perm) is 9741 and the value of self.total_size is 9742. What is the reason for this?

xycforgithub commented 2 years ago

Thanks for pointing that out! We just fixed the bug and that should resolve the problem. We append the data to be a multiple of #gpu for multi-gpu training, so len(self.perm) should be 9742.

yuanzhiyong1999 commented 2 years ago

Thanks for pointing that out! We just fixed the bug and that should resolve the problem. We append the data to be a multiple of #gpu for multi-gpu training, so len(self.perm) should be 9742.

Thank you for your reply, I reran the latest code and a new problem appeared:

(yzy-KEAR) jizhi2@jizhi2-MS-7A78:/media/jizhi2/软件/yzy/KEAR$ bash/task_train.sh
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
start is 1660199195.9345322
[1877331] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
start is 1660199196.0026214
[1877332] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
[1877332]: world_size = 2, rank = 1, backend=nccl
[1877331]: world_size = 2, rank = 0, backend=nccl
batch size: 2, total_batch_size: 10batch size: 2, total_batch_size: 10

clearing output folder.
args.fp16 is 0
args.fp16 is 0
load_vocab google/electra-large-discriminator
load_vocab google/electra-large-discriminator
load_data data/csqa_ret_3datasets/train_data.json
load_data data/csqa_ret_3datasets/train_data.json
data: 9742, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 9742, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 1222, world_size: 2
get dir test/
make dataloader ...
data: 1222, world_size: 2
get dir test/
make dataloader ...
max len: 200
95 percent len: 98
train_data 9742
total length: 2436
max len: 200
95 percent len: 98
train_data 9742
total length: 2436
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
model_type= electra
model_type= electra
init model finished.
init model finished.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.bias', 'scorer.csqa_ret_3datasets.scorer.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.bias', 'scorer.csqa_ret_3datasets.scorer.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-08-11 14:27:05,046 - __main__ - INFO - initializing trainer.
2022-08-11 14:27:05,046 - __main__ - INFO - initializing trainer.
Trainer: fp16 is 0
2022-08-11 14:27:05,730 - __main__ - INFO - initialize trainer finished.
2022-08-11 14:27:05,730 - __main__ - INFO - setting up optimizer
Trainer: fp16 is 0
2022-08-11 14:27:05,730 - __main__ - INFO - initialize trainer finished.
2022-08-11 14:27:05,730 - __main__ - INFO - setting up optimizer
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2022-08-11 14:27:05,736 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
2022-08-11 14:27:05,737 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
load successfully.load successfully.

2022-08-11 14:27:05,739 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-11 14:27:05,739 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
  File "task.py", line 410, in <module>
Traceback (most recent call last):
  File "task.py", line 410, in <module>
        srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)

  File "task.py", line 93, in train
  File "task.py", line 93, in train
    self.trainer.train(    
self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 83, in train
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 83, in train
    loss = self._step(batch)    
loss = self._step(batch)
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 110, in _step
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 110, in _step
    loss, _ = self._forward(batch, self.train_record, mode='train', dataset_name=self.config.data_version)    
loss, _ = self._forward(batch, self.train_record, mode='train', dataset_name=self.config.data_version)
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 166, in _forward
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 166, in _forward
        batch = clip_batch(batch)batch = clip_batch(batch)

  File "/media/jizhi2/软件/yzy/KEAR/utils/tensor.py", line 25, in clip_batch
  File "/media/jizhi2/软件/yzy/KEAR/utils/tensor.py", line 25, in clip_batch
        this_loc = input_ids[i, :, -1].any() if num_dim == 3 else input_ids[i, -1]this_loc = input_ids[i, :, -1].any() if num_dim == 3 else input_ids[i, -1]

RuntimeErrorRuntimeError: : all only supports torch.uint8 and torch.bool dtypesall only supports torch.uint8 and torch.bool dtypes

Traceback (most recent call last):
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '1e-5', '--append_answer_text', '1', '--weight_decay', '0.01', '--preset_model_type', 'electra', '--batch_size', '2', '--max_seq_length', '50', '--num_train_epochs', '10', '--save_interval_step', '2', '--continue_train', '--print_number_per_epoch', '2', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--print_loss_step', '10', '--clear_output_folder']' returned non-zero exit status 1.
xycforgithub commented 2 years ago

We have not encountered this problem. It looks like it could be problem with pytorch version. Which version of pytorch are you using? We recommend using the attached docker (see quickstart in Readme) to reproduce the results.

yuanzhiyong1999 commented 2 years ago

I'm currently using pytorch version 1.7.1, and I can't use docker for the next step at the moment due to the limitations.

xycforgithub commented 2 years ago

Can you try pytorch 1.8.0 and transformers 4.10.2? Without the docker environment it would be hard to debug the problem. To avoid the problem you can also simply comment out the "clip_batch" references (it is only for optimizing the speed).

yuanzhiyong1999 commented 2 years ago

I commented out the "clip_batch" and it works, thank you very much for your help!

yuanzhiyong1999 commented 2 years ago

Hello, when I run deepspeed task.py --append_descr 1 --append_triples --append_retrieval 1 --data_version csqa_ret_3datasets --lr 4e-6 --append_answer_text 1 --weight_decay 0.1 --preset_model_type debertav3 --batch_size 4 --max_seq_length 512 --num_train_epochs 15 --save_interval_step 4 --continue_train --print_number_per_epoch 1 --vary_segment_id --seed 42 --warmup_proportion 0.1 --optimizer_type adamw --ddp --deepspeed --freq_rel 1 in the “task_train.sh” file, the prompt message is as follows:

2022-08-13 10:13:29,347 - utils.trainer - INFO - total n_step = 1218, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-13 10:13:29,348 - utils.trainer - INFO - total n_step = 1218, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
  File "task.py", line 410, in <module>
    srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
  File "task.py", line 93, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 71, in __getitem__
    data = self.get_example(idx)
  File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 89, in get_example
    features.append(Feature.make_single(example.idx, main_tokens, context_tokens, self.tokenizer,
UnboundLocalError: local variable 'context_tokens' referenced before assignment
Traceback (most recent call last):
  File "task.py", line 410, in <module>
    srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)
  File "task.py", line 93, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 71, in __getitem__
    data = self.get_example(idx)
  File "/media/jizhi2/软件/yzy/KEAR/specific/tensor.py", line 89, in get_example
    features.append(Feature.make_single(example.idx, main_tokens, context_tokens, self.tokenizer,
UnboundLocalError: local variable 'context_tokens' referenced before assignment
[2022-08-13 10:13:33,043] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2641145
[2022-08-13 10:13:33,044] [INFO] [launch.py:286:sigkill_handler] Killing subprocess 2641146
[2022-08-13 10:13:33,070] [ERROR] [launch.py:292:sigkill_handler] ['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--append_triples', '--append_retrieval', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '4e-6', '--append_answer_text', '1', '--weight_decay', '0.1', '--preset_model_type', 'debertav3', '--batch_size', '4', '--max_seq_length', '512', '--num_train_epochs', '15', '--save_interval_step', '4', '--continue_train', '--print_number_per_epoch', '1', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--deepspeed', '--freq_rel', '1'] exits with return code = 1

What is the reason for this, please?

xycforgithub commented 2 years ago

Have you pulled the latest version? Our latest version should not have this problem any more. If you check the implementation at https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/specific/tensor.py#L89 the context_tokens is always defined. The old version might have the problem you mentioned.

yuanzhiyong1999 commented 2 years ago

I'm currently using the latest version of the code and I don't know why this problem still occurs. There is another question I would like to ask: if I don't use deepspeed, what changes do I need to make and how many changes do I need to make?

xycforgithub commented 2 years ago

Can you check the code here: https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/specific/tensor.py#L83 Is it "if" or "elif"? If it is "elif" as in old version the above error will appear. If your code is indeed new you may want to debug into the lines in tensor.py to see what happened - it is unlikely that error will pop up. Our code support using pytorch DDP instead of deepspeed. You can checkout the example here: https://github.com/microsoft/KEAR/blob/7376a3d190e5c04d5da9b99873abe621ae562edf/bash/task_train.sh#L13