tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
548 stars 165 forks source link

Code Search using Code2Seq on the CodeSearchNet Python dataset #84

Closed aishwariyarao217 closed 3 years ago

aishwariyarao217 commented 3 years ago

Hi, I'm going to be using the CodeSearchNet Python dataset for the task of code search. The input data is in json lines format and I was wondering if you could provide some guidelines on how I can convert the CodeSearchNet dataset into the correct format in order to train code2seq. I came across this https://github.com/tech-srl/code2seq/issues/41 and was wondering if I can modify the scripts to use the Python code instead.

urialon commented 3 years ago

Hi @aishwariyarao217 , This will probably require a few steps.

First, you will need a Python extractor that converts Python code into our format. See possible options here: https://github.com/tech-srl/code2seq#extending-to-other-languages

Second, you will probably need to modify that extractor to read the format of the CodeSearchNet dataset.

Best, Uri

aishwariyarao217 commented 3 years ago

Thankyou! And is there any way code2seq can be modified to output the code vectors and code caption as an embedding instead of actual sentence? For the code search task, the code vectors along with the code caption vectors need to be in the same space. Do you have any suggestions for how to go about this?

urialon commented 3 years ago

Hi @aishwariyarao217 , It will require some coding, because currently the caption is "decoded", not "encoded".

To address code search, you might want to encode the code snippet as it is now, and take the context_average as the representation of the code. Simultaneously, you can train another LSTM encoder to encode the caption, and encourage the "code vector" to be close to the "caption vector" using an appropriate loss.

Best, uri

aishwariyarao217 commented 3 years ago

Great, I will try it. Thank you so much for your time @urialon! I really appreciate it.