Application to real case study

tech-srl / code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Representations of Code"

https://code2vec.org

MIT License

1.11k stars 286 forks source link

Application to real case study #176

Open Avv22 opened 1 year ago

Avv22 commented 1 year ago

Hello Code2Vec team,

Could you please give some hints giving a whole software of code written in different programming languages, how it's possible to apply your tool on it?

urialon commented 1 year ago

Hey @Avv22 , Thank you for your interest in our work!

Our repository supports only Java and C#. We have a newer model that supports all languages called PolyCoder. Loading it takes only a few lines of code using the Huggingface Transformers library. see:

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs#october-2022---polycoder-is-available-on-huggingface

Best, Uri

Avv22 commented 1 year ago

Hey @Avv22 , Thank you for your interest in our work!

Our repository supports only Java and C#. We have a newer model that supports all languages called PolyCoder. Loading it takes only a few lines of code using the Huggingface Transformers library. see:

https://arxiv.org/pdf/2202.13169.pdf https://github.com/VHellendoorn/Code-LMs#october-2022---polycoder-is-available-on-huggingface

Best, Uri

Thank you. I mean if we have a big software of source code in Java. What would be your strategy you decompose the software and give it to your tool please?

urialon commented 1 year ago

Sorry, I don't understand you're question. What is your goal? What are you trying to do?

Avv22 commented 1 year ago

Sorry, I don't understand you're question. What is your goal? What are you trying to do?

Thank you for your quick reply. I just meant that if we have a complete system. How we can decompose it and pass if to your model so that it predicts the names of blocks inside the system? Do you suggest decomposing the system method-wise and then try to predict a name for each method?

I was trying to use your tool to generate a script (name to tell what software does).

urialon commented 1 year ago

Do you suggest decomposing the system method-wise and then try to predict a name for each method?

Yes, this is basically what our preprocessing pipeline does automatically.

Avv22 commented 1 year ago

Do you suggest decomposing the system method-wise and then trying to predict a name for each method?

Yes, this is basically what our preprocessing pipeline does automatically.

Thank you very much. You split the code method-based, but can you please show (reference) where you do that in your code? Did you do it by yourself or you used a tool for that?

urialon commented 1 year ago

First, our code goes through all files in the directory: https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/App.java#L43-L47

Then, I used JavaParser to parse each file in the project, and traverse the resulting AST and extract "method nodes": https://github.com/tech-srl/code2vec/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java#L39-L49

But that's a very Java-specific pipeline, I wouldn't use the same code for JavaScript.

Avv22 commented 1 year ago

@urialon. Thank you. Appreciate it. Would you recommend a similar PythonPraser for Python please to extract the method node if possible?

urialon commented 1 year ago

Yes, Our newer project Code2seq has a Python extractor, and the model itself is also much better.

Best, Uri

Avv22 commented 1 year ago

Yes, Our newer project Code2seq has a Python extractor, and the model itself is also much better.

Best, Uri

Thanks. The Python extractor you developed works similarily to how JavaParser works by extracting method node form python AST source code, please?

urialon commented 1 year ago

It was contributed from the community, so it might be a little different. I think that by default, it was designed to process a specific dataset. However its logic is the same and its output fits the code2seq model.

Best, Uri