Dataset Format and C2v file

ShaliniR11 commented 2 years ago

Hi Dr. Alon, I am doing a research at my university and we are trying to use the Code2Vec model. Can you please answer the following questions for me:

Can I give snippets of code in .java file inside the train,test and val directories as my own dataset for the model?
Will the model generate the C2v file or i need to feed them to the model?
You mentioned that the C2v files have a hashed version of the AST path. Is it possible to get a non hashed version from this file in order to see the original path? Thanks in advance

urialon commented 2 years ago

Hi @ShaliniR11 , Thank you for your interest in our work!

Yes, of course! But you will need to preprocess them before feeding them to the model. See here: https://github.com/tech-srl/code2vec#creating-and-preprocessing-a-new-java-dataset
Our preprocessing script converts raw .java files into .c2v files that the model can load.
Yes, you can add a --no_hash flag by changinng the following lines https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/extract.py#L28-L30 into:
```
command = ['java', '-cp', args.jar, 'JavaExtractor.App',
           '--max_path_length', str(args.max_path_length), '--max_path_width', str(args.max_path_width),
           '--dir', dir, '--num_threads', str(args.num_threads), '--no_hash']
```
However, notice that the models provided by us in this repository were already trained with hashed paths. So, you will need to re-run the preprocessing step and re-train a model without hashing.

Best, Uri

ShaliniR11 commented 2 years ago

Thank You for the quick response! I also wanted to know, the code2vec model generates all of these AST paths for a given code and it would also select the AST path with the highest attention right. Is there a way I could extract this AST path(the one with highest attention) separately for a given snippet of code?

urialon commented 2 years ago

Not exactly, it does not "select" any AST paths, it just uses all of them. The "selection" is implicit, by scoring them internally and weighting them according to their score.

Regarding extracting the top-attended AST paths: Did you try this part of the README? https://github.com/tech-srl/code2vec#step-4-manual-examination-of-a-trained-model ? I think it implements what you are looking for.

ShaliniR11 commented 2 years ago

thank you !!

tech-srl / code2vec

Dataset Format and C2v file #157