Closed aldopareja closed 1 year ago
training on codet5_base works, how can I fix this?
CUDA_VISIBLE_DEVICES=0 python /home/aldo/CodeT5/run_gen.py --do_train --do_eval --do_eval_bleu --do_test --task concode --sub_task none --model_type codet5 --data_num 100 --num_train_epochs 1 --warmup_steps 10 --learning_rate 10e-5 --patience 3 --tokenizer_name=Salesforce/codet5-large --model_name_or_path=Salesforce/codet5-large --data_dir /home/aldo/CodeT5/data --cache_path saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/cache_data --output_dir saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1 --summary_dir tensorboard --save_last_checkpoints --always_save_model --res_dir saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/prediction --res_fn results/concode_codet5_base.txt --train_batch_size 8 --eval_batch_size 8 --max_source_length 320 --max_target_length 150 2>&1 | tee saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/train.log 02/16/2023 21:59:27 - INFO - __main__ - Namespace(adam_epsilon=1e-08, add_lang_ids=False, add_task_prefix=False, always_save_model=True, beam_size=10, cache_path='saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/cache_data', config_name='', data_dir='/home/aldo/CodeT5/data', data_num=100, dev_filename=None, do_eval=True, do_eval_bleu=True, do_lower_case=False, do_test=True, do_train=True, eval_batch_size=8, eval_steps=-1, eval_task='', gradient_accumulation_steps=1, lang='java', learning_rate=0.0001, load_model_path=None, local_rank=-1, log_steps=-1, max_grad_norm=1.0, max_source_length=320, max_steps=-1, max_target_length=150, model_name_or_path='Salesforce/codet5-large', model_type='codet5', no_cuda=False, num_train_epochs=1, output_dir='saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1', patience=3, res_dir='saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/prediction', res_fn='results/concode_codet5_base.txt', save_last_checkpoints=True, save_steps=-1, seed=1234, start_epoch=0, sub_task='none', summary_dir='tensorboard', task='concode', test_filename=None, tokenizer_name='Salesforce/codet5-large', train_batch_size=8, train_filename=None, train_steps=-1, warmup_steps=10, weight_decay=0.0) 02/16/2023 21:59:27 - WARNING - configs - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False, cpu count: 96 02/16/2023 21:59:40 - INFO - models - Finish loading model [738M] from Salesforce/codet5-large 02/16/2023 21:59:45 - INFO - utils - Read 100 examples, avg src len: 68, avg trg len: 26, max src len: 249, max trg len: 82 02/16/2023 21:59:45 - INFO - utils - [TOKENIZE] avg src len: 202, avg trg len: 33, max src len: 741, max trg len: 104 02/16/2023 21:59:45 - INFO - utils - Create cache data into saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/cache_data/train_100.pt 100%|##########| 100/100 [00:02<00:00, 47.08it/s] 02/16/2023 21:59:47 - INFO - __main__ - ***** Running training ***** 02/16/2023 21:59:47 - INFO - __main__ - Num examples = 100 02/16/2023 21:59:47 - INFO - __main__ - Batch size = 8 02/16/2023 21:59:47 - INFO - __main__ - Batch num = 13 [0] Train loss 11.713: 100%|##########| 13/13 [00:12<00:00, 1.05it/s] 02/16/2023 21:59:59 - INFO - utils - Read 100 examples, avg src len: 69, avg trg len: 31, max src len: 193, max trg len: 100 02/16/2023 21:59:59 - INFO - utils - Create cache data into saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/cache_data/dev_100.pt 100%|##########| 100/100 [00:02<00:00, 48.75it/s] 02/16/2023 22:00:01 - INFO - __main__ - ***** Running ppl evaluation ***** 02/16/2023 22:00:01 - INFO - __main__ - Num examples = 100 02/16/2023 22:00:01 - INFO - __main__ - Batch size = 8 Eval ppl: 100%|##########| 13/13 [00:03<00:00, 3.57it/s] 02/16/2023 22:00:05 - INFO - __main__ - epoch = 0 02/16/2023 22:00:05 - INFO - __main__ - eval_ppl = 26.63935 02/16/2023 22:00:05 - INFO - __main__ - global_step = 13 02/16/2023 22:00:05 - INFO - __main__ - ******************** 02/16/2023 22:00:13 - INFO - __main__ - Save the last model into saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/checkpoint-last/pytorch_model.bin 02/16/2023 22:00:13 - INFO - __main__ - Best ppl:26.63935 02/16/2023 22:00:13 - INFO - __main__ - ******************** 02/16/2023 22:00:20 - INFO - __main__ - Save the best ppl model into saved_large/concode/codet5_large_100_lr10_bs8_src320_trg150_pat3_e1/checkpoint-best-ppl/pytorch_model.bin 02/16/2023 22:00:20 - INFO - __main__ - ***** CUDA.empty_cache() ***** 02/16/2023 22:00:20 - INFO - utils - Read 100 examples, avg src len: 69, avg trg len: 31, max src len: 193, max trg len: 100 02/16/2023 22:00:20 - INFO - utils - Sample 5k data for computing bleu from /home/aldo/CodeT5/data/concode/dev.json 100%|##########| 100/100 [00:02<00:00, 48.45it/s] 02/16/2023 22:00:22 - INFO - __main__ - ***** Running bleu evaluation on dev data***** 02/16/2023 22:00:22 - INFO - __main__ - Num examples = 100 02/16/2023 22:00:22 - INFO - __main__ - Batch size = 8 Eval bleu for dev set: 100%|##########| 13/13 [02:01<00:00, 9.36s/it] Traceback (most recent call last): File "/home/aldo/CodeT5/run_gen.py", line 388, in <module> main() File "/home/aldo/CodeT5/run_gen.py", line 315, in main result = eval_bleu_epoch(args, eval_data, eval_examples, model, tokenizer, 'dev', 'e%d' % cur_epoch) File "/home/aldo/CodeT5/run_gen.py", line 153, in eval_bleu_epoch codebleu = calc_code_bleu.get_codebleu(gold_fn, output_fn, args.lang) File "/home/aldo/CodeT5/evaluator/CodeBLEU/calc_code_bleu.py", line 21, in get_codebleu assert len(hypothesis) == len(pre_references[i]) AssertionError
training on codet5_base works, how can I fix this?