Closed aishwariyarao217 closed 3 years ago
Hi @aishwariyarao217 , This will probably require a few steps.
First, you will need a Python extractor that converts Python code into our format. See possible options here: https://github.com/tech-srl/code2seq#extending-to-other-languages
Second, you will probably need to modify that extractor to read the format of the CodeSearchNet dataset.
Best, Uri
Thankyou! And is there any way code2seq can be modified to output the code vectors and code caption as an embedding instead of actual sentence? For the code search task, the code vectors along with the code caption vectors need to be in the same space. Do you have any suggestions for how to go about this?
Hi @aishwariyarao217 , It will require some coding, because currently the caption is "decoded", not "encoded".
To address code search, you might want to encode the code snippet as it is now, and take the context_average
as the representation of the code.
Simultaneously, you can train another LSTM encoder to encode the caption, and encourage the "code vector" to be close to the "caption vector" using an appropriate loss.
Best, uri
Great, I will try it. Thank you so much for your time @urialon! I really appreciate it.
Hi, I'm going to be using the CodeSearchNet Python dataset for the task of code search. The input data is in json lines format and I was wondering if you could provide some guidelines on how I can convert the CodeSearchNet dataset into the correct format in order to train code2seq. I came across this https://github.com/tech-srl/code2seq/issues/41 and was wondering if I can modify the scripts to use the Python code instead.