The evaluation result using the dataset eTOUR from CoEST seems to be poor

zhu762 commented 2 years ago

I formatted the data in the eTOUR data set as the csv example in the second step, and then used the siamese model you provided for the second step of training, but the evaluation result showed that the f1 score was only 0.11, which was not even as good as VSM. I think it may be caused by the small amount of eTOUR data, so I took out 100 pieces of data in the keras-team/keras data set you provided for training and evaluation, and the f1 score even reached 1.0. So I ruled out this possibility. The following figure shows the parameters I used during training. I would like to ask you what could be the reason for the poor evaluation results? .

zhu762 commented 2 years ago

In addition, I added the following two lines of code to the __index_exmaple method of your Example class. I think that if these two lines of code are missing, reverse_NL_index and reverse_PL_index seem to be always empty, and will not work in the subsequent judgments. May I ask Will my modification affect the evaluation result? Finally, thank you for paying attention to my problem in your busy schedule!

jinfenglin commented 2 years ago

I think there are multiple causes:

The underlying language model (Also the intermedia training) in the TraceBert is trained on the python language while the eTour is a Java project.
keras is a project available in the code search net, which means the functions - document pairs participate in the intermedia training. Even the final fine-tuning step is small, it already gets extra knowledge from the intermedia training.

jinfenglin commented 2 years ago

These lines should be added but should have no impact on the evaluation. The reverse_index is used to recover the real-id for the prediction instances when their numeric ids are provided. The model uses numeric ids internally, which are included in the NL_index and PL_index.

zhu762 commented 2 years ago

Thank you for answering my question. I will look for a data set containing python code to try again.

I have one more question. I try to run the code you provided and I use the siamese model you provided. But the results of the operation are quite different from the results provided in your literature. The f1 score of the flask dataset is 0.76, and the f1 score of the pgcli dataset is 0.851, which seems to be the opposite of the data you provided. However, the f1 score of the keras dataset is 0.964, which is close to the data you provided. Is there a problem with this result?

Attached are the evaluation results of the three data sets and the parameter settings of my running code. train eval results.zip

jinfenglin commented 2 years ago

I checked the origin output file to make sure I did not fill the wrong columns :) I am not what is the exact cause, but the model I uploaded is different from the ones I used in the paper. I guess the randmness and model selection (e.g. which checkpoints to use) has an impact on the down stream task. It is an interesting observation though.

zhu762 commented 2 years ago

Thank you for your patience, I wish you a happy life.

jinfenglin / TraceBERT

The evaluation result using the dataset eTOUR from CoEST seems to be poor #4