Generating embeddings for Python and Java

tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"

http://code2seq.org

MIT License

548 stars 165 forks source link

Generating embeddings for Python and Java #104

Closed Avv22 closed 2 years ago

Avv22 commented 2 years ago

Hello,

Thanks again for your work.

Can you please explain how to use the model to generate embeddings for a source file in Python and also for Java? Do we have to train your model on Java dataset and Python dataset in order to use the model to generate embeddings of source code? Also is it possible to have embeddings in a fixed size of let us say 100 represented as numerical data for each file?

urialon commented 2 years ago

Hi Avra, Thank you for your interest in this work! Sorry again for the delayed response.

Yes, in order to use code2vec for python, you will have to train the model on a python dataset. Have you seen this section in the README? https://github.com/tech-srl/code2vec#extending-to-other-languages

Uri

Avv22 commented 2 years ago

@urialon. astminer team helped me producing training, testing and validation python.c2v data. So, how we should proceed next please to train code2vec model? Then once we train the model, can we feed the same data to produce embeddings as we have 20k python files that we split into train, test and validate before feeding them to astminer tool. Once we train code2vec, we would like to feed the same python code to produce embeddings, so what do you think please?

urialon commented 2 years ago

I am not sure how your python.c2v look like, but try to continue running the preprocess.sh script starting from this line: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L54 (and adapt all paths according to your files).

Avv22 commented 2 years ago

@urialon.

Thank you. So I will train your model on 150k python dataset. How to please save the model to use it later on another python dataset to generate embeddings? Does the preprocessor.sh does it automatically and save the model please?

We would like too once we train the model on 150k python dataset you specified to use to generate later on one embedding vector for each python file we have in our own dataset, can we do that please? We don't want to generate method name but one embedding that is representative of a file. We would like to do the same for our 20k python files.

urialon commented 2 years ago

Hi @Avra2 ,

preprocess.sh just preprocesses the data, it does not even train the model. However, train.sh trains and saves the checkpoints. See: https://github.com/tech-srl/code2vec#training-a-model-from-scratch