tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
548 stars 165 forks source link

Exporting code vectors #126

Closed colebuckleyy closed 2 years ago

colebuckleyy commented 2 years ago

Hello, I would like to use this repository in a similar manner to code2vec, where I can export code vectors that represent source code files. I'm doing this for c source code and am running into some issues with the parser I'm using for code2vec but the parser linked to code2seq works flawlessly. Is there any way I can export code vectors in a similar manner to code2vec?

urialon commented 2 years ago

Hi @colebuckleyy , Thank you for your interest in our work!

Yes, it's definitely possible, but let me verify that I understand:

  1. What do you mean by "the parser linked to code2seq"? We have no parser for C in this repository.
  2. Did you train a model for C? The models that we released here are for Java.
  3. In another project, we have recently released a multi-lingual model called PolyCoder: https://arxiv.org/pdf/2202.13169.pdf and code here: https://github.com/VHellendoorn/Code-LMs In C, we even managed to get better perplexity than OpenAI's Codex, you might want to check it out as well.

Best, Uri

colebuckleyy commented 2 years ago

I apologize, I meant the extractor for c++ by kolkir seemed to work with c source code and works exactly how I need it to. How could I go about training a model for C? Could I use polycoder to do so?

urialon commented 2 years ago

OK great, so now you can run the preprocessing step, maybe only starting from this line: https://github.com/tech-srl/code2seq/blob/master/preprocess.sh#L54 (because the previous lines use the Java extractor), and train a model using these instructions: https://github.com/Kolkir/code2seq#training-a-model-from-scratch and point the training scripts to your preprocessed data.

Regardless, PolyCoder is already trained, so you can use it out of the box. The main difference though is that PolyCoder is a left-to-right code generation model, so it can provide a vector for every token, rather than a single vector for the entire snippet.

Let me know if you have any further questions. Best, Uri

colebuckleyy commented 2 years ago

Ok so after I run train.sh and finish training how can I get the code vectors? Sorry for the late response

urialon commented 2 years ago

It is currently not implemented here, only in code2vec: https://github.com/tech-srl/code2vec#exporting-the-trained-token-vectors-and-target-vectors It can be implemented here as well, by following the same code2vec pipeline.

This contexts_average tensor: https://github.com/tech-srl/code2seq/blob/master/model.py#L414 contains a vector for every example. It needs to be returned from the function and from the functions that calls it, and added to this sess.run(...) arguments: https://github.com/tech-srl/code2seq/blob/master/model.py#L614

You can see an example by following the code_vectors object at the code2vec pipeline: https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L292

Best, Uri

colebuckleyy commented 2 years ago

Thank you, I really appreciate it