AssertionError: 1382 sample gt_str != true_gt

changranelk commented 2 years ago

Hi there! Happy new year!

When running code completion token level (java), after fintuning and model ckpt is saved. Getting error with evaluation and inference with the following error msg.

01/05/2022 10:45:21 - INFO - __main__ -   3855034, 0.7707994274499265
01/05/2022 10:48:48 - INFO - __main__ -   400 are done!
01/05/2022 10:48:48 - INFO - __main__ -   5134001, 0.7672330020971948
Traceback (most recent call last):
  File "run_lm.py", line 715, in <module>
    main()
  File "run_lm.py", line 710, in main
    test_total, test_cr = eval_acc(args, model, tokenizer, 'test')
  File "run_lm.py", line 459, in eval_acc
    total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file)
  File "run_lm.py", line 478, in post_process
    assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt"
AssertionError: 1382 sample gt_str != true_gt

The finetuning cmd I used was

LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=microsoft/CodeGPT-small-java        # microsoft/CodeGPT-small-py for py150
LOGFILE=completion_javaCorpus.log
PER_NODE_GPU=1       # modify YOUR_GPU_NUM

CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=$PER_NODE_GPU run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_train \
        --gpu_per_node $PER_NODE_GPU \
        --learning_rate=8e-5 \
        --weight_decay=0.01 \
        --evaluate_during_training \
        --per_gpu_train_batch_size=1 \
        --per_gpu_eval_batch_size=2 \
        --gradient_accumulation_steps=8 \
        --num_train_epochs=5 \
        --logging_steps=100 \
        --save_steps=1000 \
        --seed=42 \
        --overwrite_output_dir \
        --not_pretrain

the evaluation and inference cmd I used was

export CUDA_VISIBLE_DEVICES=2
LANG=java                       # set python for py150
DATADIR=../dataset/javaCorpus/token_completion
LITFILE=../dataset/javaCorpus/literals.json
OUTPUTDIR=../save/javaCorpus
PRETRAINDIR=../save/javaCorpus/checkpoint-3000-3.3398       # directory of your saved model
LOGFILE=completion_javaCorpus_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42

Any comments or suggestions will be appreciated, thanks in advance!

celbree commented 2 years ago

Hi, I checked the sample and found that the ground truth is <s> </s> (two spaces) while after post_process, it becomes <s> </s> (one space), which raises assertion error. I have push a commit to fix this bug in data processing. You can re-download the dataset and re-run the preprocess script or just deleting 1383th line in test.txt as this sample is meaningless and is omitted in the new script.

changranelk commented 2 years ago

awesome man, thanks for the patient explanation!

wannita901 commented 2 years ago

Hi, could you re-open this? I downloaded the repo last week and found the similar issue but with py150 dataset

02/23/2022 15:15:50 - INFO - main - 3200 are done! 02/23/2022 15:15:50 - INFO - main - 33962002, 0.7616674364485344 02/23/2022 15:16:39 - INFO - main - 3300 are done! 02/23/2022 15:16:39 - INFO - main - 35024275, 0.7621001719521675 02/23/2022 15:17:28 - INFO - main - 3400 are done! 02/23/2022 15:17:28 - INFO - main - 36095119, 0.7619678162025175 02/23/2022 15:18:18 - INFO - main - 3500 are done! 02/23/2022 15:18:18 - INFO - main - 37150743, 0.7621929122655771 Traceback (most recent call last): File "run_lm.py", line 715, in main() File "run_lm.py", line 710, in main test_total, test_cr = eval_acc(args, model, tokenizer, 'test') File "run_lm.py", line 459, in eval_acc total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file) File "run_lm.py", line 478, in post_process assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt" AssertionError: 28531 sample gt_str != true_gt

My evaluation command was

export CUDA_VISIBLE_DEVICES=0
LANG=python                       # set python for py150
DATADIR=../dataset/py150/token_completion
LITFILE=../dataset/py150/literals.json
OUTPUTDIR=../save/py150
PRETRAINDIR=../save/py150/checkpoint-last       # directory of your saved model
LOGFILE=completion_py150_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42

Any suggestion will be appreciated. Thanks!

microsoft / CodeXGLUE

AssertionError: 1382 sample gt_str != true_gt #98