tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.11k stars 286 forks source link

PRECISION, RECALL, and F1 are always 0 when training #151

Open WQR53 opened 2 years ago

WQR53 commented 2 years ago

I use astminer to generate C data for feeding code2vec. And the dataset is from https://github.com/intel/neuro-vectorizer . However, PRECISION, RECALL, and F1 are always zero when training. I use the command source train.sh to run train.sh and the following output was obtained (partially).

2022-05-03 01:26:07,479 INFO     After 12 epochs -- top10_acc: [0.32272727 0.51818182 0.61363636 0.66818182 0.69090909 0.73636364
 0.75454545 0.76363636 0.76818182 0.79545455], precision: 0.0, recall: 0.0, F1: 0
2022-05-03 01:26:16,707 INFO     Average loss at batch 100: 0.008121,   throughput: 399 samples/sec
2022-05-03 01:26:26,561 INFO     Saved after 13 epochs in: models/try_c_large/saved_model_iter13
2022-05-03 01:26:26,628 INFO     Starting evaluation
2022-05-03 01:26:26,943 INFO     Done evaluating, epoch reached
2022-05-03 01:26:26,944 INFO     Evaluation time: 0H:0M:0S
2022-05-03 01:26:26,944 INFO     After 13 epochs -- top10_acc: [0.39545455 0.46363636 0.53636364 0.61363636 0.67272727 0.71818182
 0.74090909 0.75454545 0.77272727 0.79090909], precision: 0.0, recall: 0.0, F1: 0
2022-05-03 01:26:45,938 INFO     Saved after 14 epochs in: models/try_c_large/saved_model_iter14
2022-05-03 01:26:46,025 INFO     Starting evaluation
2022-05-03 01:26:46,338 INFO     Done evaluating, epoch reached
2022-05-03 01:26:46,338 INFO     Evaluation time: 0H:0M:0S
2022-05-03 01:26:46,339 INFO     After 14 epochs -- top10_acc: [0.5        0.60454545 0.62272727 0.67272727 0.71363636 0.75454545
 0.76818182 0.78636364 0.79090909 0.80909091], precision: 0.0, recall: 0.0, F1: 0
2022-05-03 01:27:04,274 INFO     Saved after 15 epochs in: models/try_c_large/saved_model_iter15
2022-05-03 01:27:04,381 INFO     Starting evaluation
2022-05-03 01:27:04,731 INFO     Done evaluating, epoch reached
2022-05-03 01:27:04,733 INFO     Evaluation time: 0H:0M:0S
2022-05-03 01:27:04,734 INFO     After 15 epochs -- top10_acc: [0.45       0.58181818 0.63636364 0.7        0.72272727 0.75454545
 0.76363636 0.78181818 0.81363636 0.83181818], precision: 0.0, recall: 0.0, F1: 0

I used astminer to get path_contexts.c2s file and divided it into three files train.c2s, test.c2s and val.c2s. Next, I modified the file preprocess.sh and got 7 c2v files: xxxx.dict.c2v, xxxx.histo.ori.c2v, xxxx.histo.path.c2v, xxxx.histo.tgt.c2v, xxxx.test.c2v, xxxx.train.c2v, xxxx.val.c2v. And then I used the command source train.sh to run train.sh but found that PRECISION, RECALL, and F1 were all 0.

urialon commented 2 years ago

Hi @WQR53 , Thank you for your interest in our work!

I don't know the reason, since astminer and neuro-vectorizer are not mine.

However, please check out this PolyCoder paper: https://arxiv.org/pdf/2202.13169.pdf and code: https://github.com/VHellendoorn/Code-LMs where we release a larger model that works for many languages. Specifically, for C, PolyCoder achieves better results than OpenAI's Codex.

Best, Uri