Integrating astminer with code2vec for C source codes

RichardZapanta commented 2 years ago

Hello!

I was able to extract path_contexts.c2s file using astminer. However, my goal is to extract code vectors from the given source codes. With the path_contexts.c2s, I don't know how to integrate it with code2vec. May I ask what will be the next steps and what are the needed files that I need to modify?

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , Thank you for your interest in our work.

Did you see these sections of the README? https://github.com/tech-srl/code2vec#extending-to-other-languages and https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

Did you see also this: https://github.com/tech-srl/code2vec/issues/60 ?

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon!

I was able to go over these sections of the README and also checked some of the issues encountered before. Based from what I understand, I need to train the model first using C source codes in order to export its code vector, is this correct?

If ever, is there a way to export the code vectors of C source codes without training the model with our dataset?

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , Yes, you will need a model that was trained on C data first.

The code vectors are meaningless without training. Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon !

I was able to successfully preprocess my data with the use of astminer and by modifying the preprocess.sh. It was able to prompt me this in the terminal

The data folder was able to produce these files.

Then, I proceed with training the model from scratch. I have changed values in the train.sh. However, I encountered an error of

IndexError: list index out of range

This is how my current train.sh file looks like

See below is the screenshot from the terminal

The data folder also produce 2 more files

I only edit preprocess.sh and train.sh and leave the rest untouched. What are the possible workarounds to fix this issue?

Thank you!

urialon commented 2 years ago

Hi, let's try to skip the "filter impossible names", by just replacing this line: https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L460

with:

prediction = top_words[0]

Let me know how it goes. Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon,

I encountered a new error after doing the said instruction, the new error said

ZeroDivisionError: float division by zero

on line 493 of tensorflow_model.py. See below is the screenshot.

Thank you!

urialon commented 2 years ago

Hi @RichardZapanta , I just fixed that, can you please pull again? Thank you for reporting this!

RichardZapanta commented 2 years ago

Hi @urialon,

Thank you very much for this. I able to train the model, however, I have some concerns.

Thank you very much for this. I was able to train the model, however, I have some concerns.

Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)
Is there a way to change the directory of the input code instead of Input.java
Lastly, is there a way to export code vectors into a text file (.txt)
Which file will be our final model after training (See the available files below)

Thank you very much once again!

urialon commented 2 years ago

Hi @RichardZapanta , Here are some answers to your questions:

Will we be able to export code vectors correctly if ever we get a very low validation results (precision, F1 and recall)

Technically yes, but these vectors might not be that "good" for downstream tasks.

Is there a way to change the directory of the input code instead of Input.java

Yes, here: https://github.com/tech-srl/code2vec/blob/master/interactive_predict.py#L29

Lastly, is there a way to export code vectors into a text file (.txt)

Yes, see https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples

Which file will be our final model after training (See the available files below)

it is your choice, but the common practice is to take the one that has got the best validation accuracy according to the training logs.

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon ,

Thank you for the quick response. We encountered some problems when trying to extract code vectors from C source codes. After training the model from scratch using source train.sh, we got very low (almost 0) validation results and also got zero when evaluating the trained model, this is because the data that we used to train the model are multiple C source codes that solve the same problem since our study is focusing on identifying source code similarity

We did the following steps

preprocess our data using source preprocess.sh with the help of astminer
train the model using source train.sh, and obtain these files

We release the model of the 2nd iteration, having this result

After 2 epochs -- top10_acc: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.], precision: 0.09090909090909091, recall: 0.09090909090909091, F1: 0.09090909090909091

and obtain these three files

We change the input file and have these lines of code as input

We run python3 code2vec.py --load models/AcerTrial/saved_model_iter2.release --predict --export_code_vectors and obtain this error

With that being is said, are there any steps that we did wrong or skipped? and what can we do to fix this issue? or are there any data files that we need so that the model can interpret the input file as C source codes and not java files?

Thank you!

P.S. accidentally closed the issue. Apologies for this.

urialon commented 2 years ago

Hi @RichardZapanta ,

It seems that the model did not learn anything useful. You can either train longer (why only 2 epochs?), or try code2seq.

Additionally, using the --predict option will not work on C code, because it expects to parse Java.

Best, Uri

urialon commented 2 years ago

Hi @RichardZapanta , We just released a model that in C works better than OpenAI's Codex.

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs

Best, Uri

RichardZapanta commented 2 years ago

Hi @urialon ,

Can the Code-LMs extract code vectors for C source code?

Thank you.

tech-srl / code2vec

Integrating astminer with code2vec for C source codes #141