about CodeBERT/CodeBERTa metric

little-pikachu commented 2 years ago

I tried to run the code smell detection experiment and I found the macro f1 score and the micro f1 score are both above 82%. But the paper report it with 71.2% for CodeBERT. Can the author help explain？

furunkel commented 2 years ago

Hi, that's quite a difference. I guess the other way round would be more problematic. We usually mean over several seeds, so with a good seed your result might be higher. Different hyper-parameters might also cause this. Did you use the same CodeBERT version, hyper-parameters, pre-processing, mean over 5 seeds?

little-pikachu commented 2 years ago

Thank you very much for the reply. I used the default seeds (including 100, 200, 300, 400, 500) in the code. Since the experiment records results every 20 steps, I selected the best one for each fold of the experiment corresponding to each seed and averaged them at the end. For the model, I used microsoft/codebert-base other than huggingface/CodeBERTa-small-v1 to evaluate the capabilities of CodeBERT models. And I didn't change any other hyper-parameters. For data preprocessing, I followed the method of the paper and did not do any preprocessing, not even for code comments. May the code comment couse it? I don't know.

giganticode / small-datasets-ml-resources

about CodeBERT/CodeBERTa metric #1