tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.09k stars 285 forks source link

Preprocessor step disposing numbers in (variable) names #164

Closed daveymathijssen closed 1 year ago

daveymathijssen commented 1 year ago

Hello,

Thank you for your awesome work!

I have trained some models from scratch using a C# dataset. However, I have a question about the code2vec C# preprocessor. Some variables within the dataset samples contain a number to make them unique, such as variable_1 and variable_2. However, the SplitToSubtokens method removes these variable numbers during preprocessing. I am interested in the purpose of removing these numbers, because when I alter the SplitToSubtokens method and do not remove the numbers, all metrics seem to improve.

Best regards, Davey

urialon commented 1 year ago

Hi @daveymathijssen , Thank you for your interest in our work!

The C# constructor was contributed by researchers from Microsoft, so I am not sure why did they remove numbers. I have might removed number in the JavaExtractor as well, I'm not sure. But if it improves your metrics when you do not remove the numbers, that's great!

Best, Uri

daveymathijssen commented 1 year ago

HI @urialon, Thanks for your answer!

In the JavaExtractor, the same is happening in the normalizeName method. Do you have any clue why?

urialon commented 1 year ago

I'm guessing that at the time, We did not want to spend embeddings on numbers, as we had ~1M embedding vocabulary anyway. This is solved in newer models by segmenting tokens into subwords.

You can also check out our newer models

Best, Uri

daveymathijssen commented 1 year ago

Thank you for your fast responses and insights!