tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Integrating astminer with code2vec for C source codes #141

Closed RichardZapanta closed 2 years ago

RichardZapanta commented 2 years ago

Hello!

I was able to extract path_contexts.c2s file using astminer. However, my goal is to extract code vectors from the given source codes. With the path_contexts.c2s, I don't know how to integrate it with code2vec. May I ask what will be the next steps and what are the needed files that I need to modify?

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , Thank you for your interest in our work.

Did you see these sections of the README? https://github.com/tech-srl/code2vec#extending-to-other-languages and https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

Did you see also this: https://github.com/tech-srl/code2vec/issues/60 ?

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon!

I was able to go over these sections of the README and also checked some of the issues encountered before. Based from what I understand, I need to train the model first using C source codes in order to export its code vector, is this correct?

If ever, is there a way to export the code vectors of C source codes without training the model with our dataset?

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , Yes, you will need a model that was trained on C data first.

The code vectors are meaningless without training. Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon !

I was able to successfully preprocess my data with the use of astminer and by modifying the preprocess.sh. It was able to prompt me this in the terminal

Screen Shot 2022-01-27 at 1 38 09 PM

The data folder was able to produce these files.

Screen Shot 2022-01-27 at 1 38 24 PM

Then, I proceed with training the model from scratch. I have changed values in the train.sh. However, I encountered an error of

IndexError: list index out of range

This is how my current train.sh file looks like

Screen Shot 2022-01-27 at 1 49 40 PM

See below is the screenshot from the terminal

Screen Shot 2022-01-27 at 1 39 28 PM

The data folder also produce 2 more files

Screen Shot 2022-01-27 at 1 39 36 PM

I only edit preprocess.sh and train.sh and leave the rest untouched. What are the possible workarounds to fix this issue?

Thank you!

urialon commented 2 years ago

Hi, let's try to skip the "filter impossible names", by just replacing this line: https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L460

with:

prediction = top_words[0]

Let me know how it goes. Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon,

I encountered a new error after doing the said instruction, the new error said

ZeroDivisionError: float division by zero

on line 493 of tensorflow_model.py. See below is the screenshot.

Screen Shot 2022-01-31 at 6 09 54 AM

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , I just fixed that, can you please pull again? Thank you for reporting this!

RichardZapanta commented 2 years ago

Hi @urialon,

Thank you very much for this. I able to train the model, however, I have some concerns.

Thank you very much for this. I was able to train the model, however, I have some concerns.

  1. Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)
  2. Is there a way to change the directory of the input code instead of Input.java
  3. Lastly, is there a way to export code vectors into a text file (.txt)
  4. Which file will be our final model after training (See the available files below)
Screen Shot 2022-02-03 at 4 23 55 PM

Thank you very much once again!

urialon commented 2 years ago

Hi @RichardZapanta , Here are some answers to your questions:

Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)

Technically yes, but these vectors might not be that "good" for downstream tasks.

Is there a way to change the directory of the input code instead of Input.java

Yes, here: https://github.com/tech-srl/code2vec/blob/master/interactive_predict.py#L29

Lastly, is there a way to export code vectors into a text file (.txt)

Yes, see https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

Which file will be our final model after training (See the available files below)

it is your choice, but the common practice is to take the one that has got the best validation accuracy according to the training logs.

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon ,

Thank you for the quick response. We encountered some problems when trying to extract code vectors from C source codes. After training the model from scratch using source train.sh, we got very low (almost 0) validation results and also got zero when evaluating the trained model, this is because the data that we used to train the model are multiple C source codes that solve the same problem since our study is focusing on identifying source code similarity

We did the following steps

  1. preprocess our data using source preprocess.sh with the help of astminer
  2. train the model using source train.sh, and obtain these files
Screen Shot 2022-02-07 at 11 35 17 PM
  1. We release the model of the 2nd iteration, having this result

After 2 epochs -- top10_acc: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], precision: 0.09090909090909091, recall: 0.09090909090909091, F1: 0.09090909090909091

and obtain these three files

Screen Shot 2022-02-07 at 11 37 40 PM
  1. We change the input file and have these lines of code as input
Screen Shot 2022-02-07 at 11 39 30 PM
  1. We run python3 code2vec.py --load models/AcerTrial/saved_model_iter2.release --predict --export_code_vectors and obtain this error
Screen Shot 2022-02-07 at 11 40 50 PM

With that being is said, are there any steps that we did wrong or skipped? and what can we do to fix this issue? or are there any data files that we need so that the model can interpret the input file as C source codes and not java files?

Thank you!

P.S. accidentally closed the issue. Apologies for this.

urialon commented 2 years ago

Hi @RichardZapanta ,

It seems that the model did not learn anything useful. You can either train longer (why only 2 epochs?), or try code2seq.

Additionally, using the --predict option will not work on C code, because it expects to parse Java.

Best, Uri

urialon commented 2 years ago

Hi @RichardZapanta , We just released a model that in C works better than OpenAI's Codex.

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon ,

Can the Code-LMs extract code vectors for C source code?

Thank you.