microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.51k stars 363 forks source link

'AssertionError: 28531 sample gt_str != true_gt' on Py150 #103

Closed wannita901 closed 2 years ago

wannita901 commented 2 years ago

Hi! I downloaded the repo two weeks ago and ran the code completion token level. The java one is all runnable. However, the python one for fine-tuning was completed, but the evaluation found a similar error to #98.

02/23/2022 15:15:50 - INFO - main - 3200 are done! 02/23/2022 15:15:50 - INFO - main - 33962002, 0.7616674364485344 02/23/2022 15:16:39 - INFO - main - 3300 are done! 02/23/2022 15:16:39 - INFO - main - 35024275, 0.7621001719521675 02/23/2022 15:17:28 - INFO - main - 3400 are done! 02/23/2022 15:17:28 - INFO - main - 36095119, 0.7619678162025175 02/23/2022 15:18:18 - INFO - main - 3500 are done! 02/23/2022 15:18:18 - INFO - main - 37150743, 0.7621929122655771 Traceback (most recent call last): File "run_lm.py", line 715, in main() File "run_lm.py", line 710, in main test_total, test_cr = eval_acc(args, model, tokenizer, 'test') File "run_lm.py", line 459, in eval_acc total_samples = post_process(args, total_pred, total_gt, open(os.path.join(args.data_dir, f"{file_type}.txt")).readlines(), saved_file) File "run_lm.py", line 478, in post_process assert gt_str == true_gts[cnt].strip(), f"{cnt} sample gt_str != true_gt" AssertionError: 28531 sample gt_str != true_gt

I try deleting the line 28530th in test.txt and so on, but it's still keeping the error. My evaluation command was

export CUDA_VISIBLE_DEVICES=0
LANG=python                       # set python for py150
DATADIR=../dataset/py150/token_completion
LITFILE=../dataset/py150/literals.json
OUTPUTDIR=../save/py150
PRETRAINDIR=../save/py150/checkpoint-last       # directory of your saved model
LOGFILE=completion_py150_eval.log

python -u run_lm.py \
        --data_dir=$DATADIR \
        --lit_file=$LITFILE \
        --langs=$LANG \
        --output_dir=$OUTPUTDIR \
        --pretrain_dir=$PRETRAINDIR \
        --log_file=$LOGFILE \
        --model_type=gpt2 \
        --block_size=1024 \
        --do_eval \
        --per_gpu_eval_batch_size=16 \
        --logging_steps=100 \
        --seed=42 

Any suggestion will be very appreciated. Thanks!

celbree commented 2 years ago

Hi, I reproduce this error. It happens in 28532th line. It turns out that it is a corner case in the preprocessing of <STR_LIT>. I have push a commit to fix this. You can download the newest preprocessing code, re-generate the dataset and try again. Thank your for pointing this out.

wannita901 commented 2 years ago

Thank you for your response! One quick question, do I need to finetune the model again? Because after I downloaded the new preprocessing code, re-generated the dataset and tried to rerun only the evaluation part, It seems like I still got the same error message?

wannita901 commented 2 years ago

All good, I finetune the model again and everything run smoothly. Thank you so much!