Closed Veeraraghavans closed 2 years ago
Hi @Veeraraghavans , Thank you for your interest in our work and for your kind words!
I am guessing that you need to run the preprocess.sh
script on the outputs of ASTMiner, starting from this line:
https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L50
But I am not able to further support ASTMiner; It was not created by me and I cannot control its outputs.
You can run preprocess.sh
and set {TRAIN,TEST_VAL}_DIR
to JavaExtractor/
(the code directory of our Java extractor) and compare the results, including the intermediate files.
Let me know how it went! Best, Uri
Thanks @urialon for your reply. I will try testing out and update the details here
I have executed the preprocess.sh
file placing the data file of structure requested by code (train,val,test). I don't see any histograms file created with value (as .c2v file is empty )because of that the total value is zero.
Data directory with data file:
Preview of Data:
Output of execution:
Hello Veeraraghavans, Thank you for opening this issue. I have a concern about ASTMiner and Code2vec. I have an OpenMP dataset that is written in C I need to convert the code into AST and decompose it to get the vectors. I used the Clang tool to get the AST however, I have no idea how to convert the AST to tokens.
Sorry for bringing this up!
@ma0889 It fine as we have few codes in C as well. It will be helpful for us as well.
Hi @Veeraraghavan ,
I cannot really tell the problem, but the histograms are created in the following lines:
https://github.com/tech-srl/code2vec/blob/master/preprocess.sh#L55-L58
Can you run the "preprocess.sh" script line-by-line and see why these histograms are not created?
Best, Uri
On Fri, Jan 21, 2022 at 5:06 AM Veeraraghavan @.***> wrote:
I have executed the preprocess.sh file placing the data file of structure requested by code (train,val,test). I don't see any histograms file created because of that the total value is zero.
Data directory with data file: [image: image] https://user-images.githubusercontent.com/5748948/150507808-571dc2e8-2ced-48fc-b751-ca66c96e365d.png
Preview of Data: [image: image] https://user-images.githubusercontent.com/5748948/150507762-50c710ba-a82f-4405-9bad-6dc2235bc4b1.png
Output of execution: [image: image] https://user-images.githubusercontent.com/5748948/150506608-36356d55-9032-4df0-89e0-452cce8362aa.png
— Reply to this email directly, view it on GitHub https://github.com/tech-srl/code2vec/issues/144#issuecomment-1018360010, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMDZ3MORV2RFTINLYITUXEVYVANCNFSM5LZG2RKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
@urialon sure I will try executing line by line and update.
@ma0889 I am trying Python as predominant one. I will try with C files if possible and update here back.
Sorry for late reply.
@urialon can you share the directory structure which process.sh
looking for. Just being unclear on it. Thinking because of it the histogram file is not created
Hi @Veeraraghavans , The snippet that you pasted exactly describes the expected structure. I'm not sure how to explain it better than that snippet.
You can download another dataset such as Java-small here: https://github.com/tech-srl/code2vec#java-small-compressed-366mb-extracted-19gb and examine the structure.
Let me know if anything is still unclear. Best, Uri
@urialon thanks understood the directory structure used by code. Still histogram is not created. I am just checking on it. Update you soon.
Also how Code2seq is different from Code2Vec? In Code2seq you have Python extractor will that be helpful for me to replace the ASTMiner?
Code2seq is much stronger, and yes, you can check its python extractor. I didn't use it myself, it was contributed, but I checked it and it looked good.
I was trying to run the Python-extractor it getting killed after few mins similar to this issues when I run in local machine. In my VM it ran for few hours and got killed couldn't find any logs written.
Output when I run extract steps:
Can you try running it on a smaller sample? E.g., one file? It might be loading all data to memory.
On Fri, Feb 4, 2022 at 06:06 Veeraraghavan @.***> wrote:
I was trying to run the Python-extractor https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor it getting killed after few mins similar to this issues https://github.com/tech-srl/code2seq/issues/106 when I run in local machine. In my VM it ran for few hours and got killed couldn't find any logs written.
Output when I run extract steps: [image: image] https://user-images.githubusercontent.com/5748948/152518683-4bfe2d73-902a-4794-8109-34f45a35b967.png
— Reply to this email directly, view it on GitHub https://github.com/tech-srl/code2vec/issues/144#issuecomment-1029887012, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSOXMEHPRFMOJ4OHMXRY73UZOXL5ANCNFSM5LZG2RKA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you were mentioned.Message ID: @.***>
Hi @Veeraraghavans , We just released a model that performs better than OpenAI's Codex for C.
https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs
Best, Uri
Hey @urialon,
Sorry for late reply. I was held up in other task. Yes, I could run the code2seq with your complete dataset training. It has just done with epoch 1. Will update once it is completed.
Thanks for sharing the model for C codes.
It will also be helpful if you could clarify my understand on "Python150kExtractor" results is model which can be used against Python codes?
Regards, Veeraraghavan
Hi @Veeraraghavans , Yes! The "Python150kExtractor" can be used to process Python data, which can then be used to train a model that can perform predictions over Python code.
Best, Uri
Thank you @urialon for your quick update. I will reach you out for any further issue.
Hey @urialon how many epochs default config runs? Its been 50 epochs the program still running and program creates model for each epoch does we get 1 single model end of it?
Regards,
Veeraraghavan
See here: https://github.com/tech-srl/code2vec#notes Let me know if you find anything to be incorrect.
Best, Uri
Hello everyone,
I really appreciate Code2vec work. I am trying to create a AI model for understand code functionality to map to our issue platform. Got to know about your solution which is quite useful for us. Since most of our codes are in Python, Go I am unable to use your code directly . So, I used ASTMiner to convert code into AST graph and pass into your model to get vectors
ASTminer path_context output is shown below.
Here are my questions:
python3 code2vec.py --data <Output_Pathcontext_ASTMiner>
but got error asking for dictionary file for it. Please correct me if it is wrong.Thanks in advance for your team. Sorry If my question is so naive or incomplete question. I am available to answer if my question arent clear.