RuntimeError: copy_if failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure error when training

theashworld commented 3 years ago

fresh dir, syncd the repo, did the pip install from requirements.txt

command line

python3 cli.py --method pet --pattern_ids 0 1 2 3 --data_dir MNLI/ --model_type roberta --model_name_or_path roberta-large --task_name mnli --output_dir out2 --do_train --do_eval

Error:

Evaluating:  20%|█████████████▍                                                     | 9820/49088 [24:35<1:38:18,  6.66it/s]
Traceback (most recent call last):
  File "cli.py", line 282, in <module>
    main()
  File "cli.py", line 263, in main
    no_distillation=args.no_distillation, seed=args.seed)
  File "/home/qblocks/shan/pet/pet/modeling.py", line 249, in train_pet
    save_unlabeled_logits=not no_distillation, seed=seed)
  File "/home/qblocks/shan/pet/pet/modeling.py", line 355, in train_pet_ensemble
    unlabeled_data=unlabeled_data))
  File "/home/qblocks/shan/pet/pet/modeling.py", line 434, in train_single_model
    results_dict['train_set_before_training'] = evaluate(model, train_data, eval_config)['scores']['acc']
  File "/home/qblocks/shan/pet/pet/modeling.py", line 490, in evaluate
    n_gpu=config.n_gpu, decoding_strategy=config.decoding_strategy, priming=config.priming)
  File "/home/qblocks/shan/pet/pet/wrapper.py", line 376, in eval
    logits = EVALUATION_STEP_FUNCTIONS[self.config.wrapper_type](self)(batch)
  File "/home/qblocks/shan/pet/pet/wrapper.py", line 525, in mlm_eval_step
    return self.preprocessor.pvp.convert_mlm_logits_to_cls_logits(batch['mlm_labels'], outputs[0])
  File "/home/qblocks/shan/pet/pet/pvp.py", line 207, in convert_mlm_logits_to_cls_logits
    masked_logits = logits[mlm_labels >= 0]
RuntimeError: copy_if failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure

theashworld commented 3 years ago

And I tried with roberta-base also, same issue

Evaluating:  53%|███████████████████████████████████▊                                | 25819/49088 [25:25<22:55, 16.92it/s]
Traceback (most recent call last):
  File "cli.py", line 282, in <module>
    main()
  File "cli.py", line 263, in main
    no_distillation=args.no_distillation, seed=args.seed)
  File "/home/qblocks/shan/pet/pet/modeling.py", line 249, in train_pet
    save_unlabeled_logits=not no_distillation, seed=seed)
  File "/home/qblocks/shan/pet/pet/modeling.py", line 355, in train_pet_ensemble
    unlabeled_data=unlabeled_data))
  File "/home/qblocks/shan/pet/pet/modeling.py", line 434, in train_single_model
    results_dict['train_set_before_training'] = evaluate(model, train_data, eval_config)['scores']['acc']
  File "/home/qblocks/shan/pet/pet/modeling.py", line 490, in evaluate
    n_gpu=config.n_gpu, decoding_strategy=config.decoding_strategy, priming=config.priming)
  File "/home/qblocks/shan/pet/pet/wrapper.py", line 376, in eval
    logits = EVALUATION_STEP_FUNCTIONS[self.config.wrapper_type](self)(batch)
  File "/home/qblocks/shan/pet/pet/wrapper.py", line 525, in mlm_eval_step
    return self.preprocessor.pvp.convert_mlm_logits_to_cls_logits(batch['mlm_labels'], outputs[0])
  File "/home/qblocks/shan/pet/pet/pvp.py", line 207, in convert_mlm_logits_to_cls_logits
    masked_logits = logits[mlm_labels >= 0]
RuntimeError: copy_if failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure
Segmentation fault (core dumped)

timoschick commented 3 years ago

Hi @theashworld, I'm on vacation this week but I'll take a look at this issue early next week.

timoschick commented 3 years ago

Hi @theashworld, I was unable to reproduce this issue so far. Could you try whether it works with a smaller training set, e.g. as follows:

python3 cli.py --method pet --pattern_ids 0 1 2 3 --data_dir MNLI/ --model_type roberta --model_name_or_path roberta-large --task_name mnli --output_dir out2 --do_train --do_eval --train_examples 100 --unlabeled_examples 30000 --split_examples_evenly

theashworld commented 3 years ago

Welcome back from the vacation. Looks like the machine had issues, I tried on another machine and it seems to be progressing fine. Closing for now, will reopen if needed. Thanks!

timoschick / pet

RuntimeError: copy_if failed to synchronize: cudaErrorLaunchFailure: unspecified launch failure error when training #5