tech-srl / code2seq

Code for the model presented in the paper: "code2seq: Generating Sequences from Structured Representations of Code"
http://code2seq.org
MIT License
548 stars 165 forks source link

Generating code documentation with code2seq #120

Closed balysMorkunas closed 2 years ago

balysMorkunas commented 2 years ago

Hi,

I am currently writing a bachelors project where my aim is to test how inline code comments in the training dataset effects the performance of generating code documentation using the code2seq model. Would you be able to briefly tell me about the possibilities of this model regarding automatic code comment generation and how would one set that up? I am very thankful for your time.

edit: I can see that issue #34 has related information, I will of course follow it for now, but if you have any additional tips it would be much appreciated!

urialon commented 2 years ago

Hi @balysMorkunas , Thank you for your interest in code2seq!

I believe that the paper may give a better intuition than what I can describe in brief: https://openreview.net/pdf?id=H1gKYo09tX

Let me know if you have any specific questions! Uri

balysMorkunas commented 2 years ago

Thank you, I'll contact again if any serious questions arise!

balysMorkunas commented 2 years ago

Hi again,

I started looking at how to retrain the model and preprocess the dataset for documentation generation. I followed your suggestion on issue #34 where you suggest to change JavaExtractor to output documentation instead of method names.

Could you please elaborate/give example on how to do that? Do you by chance mean to use node.getJavaDoc() instead of node.getName()? What other changes should I be aware of?

Thank you very much for your time and effort, I really appreciate it, Balys.

bacevicius commented 2 years ago

Hi @urialon, I am in a very similar situation to @balysMorkunas and would also like to hear your input about this question. Thank you for your time!

urialon commented 2 years ago

Hi @balysMorkunas and @bacevicius , Thank you for your interest in code2seq!

Do you by chance mean to use node.getJavaDoc() instead of node.getName()

Basically yes!

Another option, if you wish to train on an existing dataset, is to set it to a unique ID, and then replace the unique ID with the documentation later. See also: https://github.com/tech-srl/code2seq/issues/45 For additional scripts and hyperparameters.

Best, Uri

balysMorkunas commented 2 years ago

Thanks for your answer @urialon !

Do you think that the hyperparameters config.SUBTOKENS_VOCAB_MAX_SIZE = 190000 and config.TARGET_VOCAB_MAX_SIZE = 27000 are enough for documentation generation, or should be increased? Anything else to watch out for, regarding the hyperparameters, maybe max_code_len and min_code_len in JavaExtractor?

Thank you very much for your time, Balys.

urialon commented 2 years ago

Hi @balysMorkunas , Sorry for the delayed response.

These hyperparameters look OK to me, but they depend on the exact dataset and can never really be known in advance. max_code_len and min_code_len refer to the size of the functions that you consider, so it is up to the dataset you are working with.

Best, Uri

balysMorkunas commented 2 years ago

Thanks for your help!