tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"
https://code2vec.org
MIT License
1.1k stars 286 forks source link

Requesting help in Dataset format to get vectors for the AST tree #144

Closed Veeraraghavans closed 2 years ago

Veeraraghavans commented 2 years ago

Hello everyone,

I really appreciate Code2vec work. I am trying to create a AI model for understand code functionality to map to our issue platform. Got to know about your solution which is quite useful for us. Since most of our codes are in Python, Go I am unable to use your code directly . So, I used ASTMiner to convert code into AST graph and pass into your model to get vectors

ASTminer path_context output is shown below.

image

Here are my questions:

  1. Do I need replace the path_context information with node_type information ?
  2. Can you share some data template used?
  3. Command I used python3 code2vec.py --data <Output_Pathcontext_ASTMiner> but got error asking for dictionary file for it. Please correct me if it is wrong.

Thanks in advance for your team. Sorry If my question is so naive or incomplete question. I am available to answer if my question arent clear.

urialon commented 2 years ago

Hi @Veeraraghavans , Thank you for your interest in our work and for your kind words!

I am guessing that you need to run the preprocess.sh script on the outputs of ASTMiner, starting from this line: https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L50

But I am not able to further support ASTMiner; It was not created by me and I cannot control its outputs. You can run preprocess.sh and set {TRAIN,TEST_VAL}_DIR to JavaExtractor/ (the code directory of our Java extractor) and compare the results, including the intermediate files.

Let me know how it went! Best, Uri

Veeraraghavans commented 2 years ago

Thanks @urialon for your reply. I will try testing out and update the details here

Veeraraghavans commented 2 years ago

I have executed the preprocess.sh file placing the data file of structure requested by code (train,val,test). I don't see any histograms file created with value (as .c2v file is empty )because of that the total value is zero.

Data directory with data file: image

Preview of Data: image

Output of execution: image

ma0889 commented 2 years ago

Hello Veeraraghavans, Thank you for opening this issue. I have a concern about ASTMiner and Code2vec. I have an OpenMP dataset that is written in C I need to convert the code into AST and decompose it to get the vectors. I used the Clang tool to get the AST however, I have no idea how to convert the AST to tokens.

Sorry for bringing this up!

Veeraraghavans commented 2 years ago

@ma0889 It fine as we have few codes in C as well. It will be helpful for us as well.

urialon commented 2 years ago

Hi @Veeraraghavan ,

I cannot really tell the problem, but the histograms are created in the following lines:

https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L55-L58

Can you run the "preprocess.sh" script line-by-line and see why these histograms are not created?

Best, Uri

On Fri, Jan 21, 2022 at 5:06 AM Veeraraghavan @.***> wrote:

I have executed the preprocess.sh file placing the data file of structure requested by code (train,val,test). I don't see any histograms file created because of that the total value is zero.

Data directory with data file: [image: image] https://user-images.githubusercontent.com/5748948/150507808-571dc2e8-2ced-48fc-b751-ca66c96e365d.png

Preview of Data: [image: image] https://user-images.githubusercontent.com/5748948/150507762-50c710ba-a82f-4405-9bad-6dc2235bc4b1.png

Output of execution: [image: image] https://user-images.githubusercontent.com/5748948/150506608-36356d55-9032-4df0-89e0-452cce8362aa.png

— Reply to this email directly, view it on GitHub https://github.com/tech-srl/code2vec/issues/144#issuecomment-1018360010, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMDZ3MORV2RFTINLYITUXEVYVANCNFSM5LZG2RKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

Veeraraghavans commented 2 years ago

@urialon sure I will try executing line by line and update.

@ma0889 I am trying Python as predominant one. I will try with C files if possible and update here back.

Sorry for late reply.

Veeraraghavans commented 2 years ago

@urialon can you share the directory structure which process.sh looking for. Just being unclear on it. Thinking because of it the histogram file is not created

image

urialon commented 2 years ago

Hi @Veeraraghavans , The snippet that you pasted exactly describes the expected structure. I'm not sure how to explain it better than that snippet.

You can download another dataset such as Java-small here: https://github.com/tech-srl/code2vec#java-small-compressed-366mb-extracted-19gb and examine the structure.

Let me know if anything is still unclear. Best, Uri

Veeraraghavans commented 2 years ago

@urialon thanks understood the directory structure used by code. Still histogram is not created. I am just checking on it. Update you soon.

Also how Code2seq is different from Code2Vec? In Code2seq you have Python extractor will that be helpful for me to replace the ASTMiner?

urialon commented 2 years ago

Code2seq is much stronger, and yes, you can check its python extractor. I didn't use it myself, it was contributed, but I checked it and it looked good.

Veeraraghavans commented 2 years ago

I was trying to run the Python-extractor it getting killed after few mins similar to this issues when I run in local machine. In my VM it ran for few hours and got killed couldn't find any logs written.

Output when I run extract steps: image

urialon commented 2 years ago

Can you try running it on a smaller sample? E.g., one file? It might be loading all data to memory.

On Fri, Feb 4, 2022 at 06:06 Veeraraghavan @.***> wrote:

I was trying to run the Python-extractor https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor it getting killed after few mins similar to this issues https://github.com/tech-srl/code2seq/issues/106 when I run in local machine. In my VM it ran for few hours and got killed couldn't find any logs written.

Output when I run extract steps: [image: image] https://user-images.githubusercontent.com/5748948/152518683-4bfe2d73-902a-4794-8109-34f45a35b967.png

— Reply to this email directly, view it on GitHub https://github.com/tech-srl/code2vec/issues/144#issuecomment-1029887012, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMEHPRFMOJ4OHMXRY73UZOXL5ANCNFSM5LZG2RKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

urialon commented 2 years ago

Hi @Veeraraghavans , We just released a model that performs better than OpenAI's Codex for C.

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs

Best, Uri

Veeraraghavans commented 2 years ago

Hey @urialon,

Sorry for late reply. I was held up in other task. Yes, I could run the code2seq with your complete dataset training. It has just done with epoch 1. Will update once it is completed.

Thanks for sharing the model for C codes.

It will also be helpful if you could clarify my understand on "Python150kExtractor" results is model which can be used against Python codes?

Regards, Veeraraghavan

urialon commented 2 years ago

Hi @Veeraraghavans , Yes! The "Python150kExtractor" can be used to process Python data, which can then be used to train a model that can perform predictions over Python code.

Best, Uri

Veeraraghavans commented 2 years ago

Thank you @urialon for your quick update. I will reach you out for any further issue.

Veeraraghavans commented 2 years ago

Hey @urialon how many epochs default config runs? Its been 50 epochs the program still running and program creates model for each epoch does we get 1 single model end of it?

Regards,

Veeraraghavan

urialon commented 2 years ago

See here: https://github.com/tech-srl/code2vec#notes Let me know if you find anything to be incorrect.

Best, Uri