Closed natalymr closed 5 years ago
Hi Nataly, Sure. Yes, the number of training examples is indeed around 53000, and it took us 31 epochs.
I must warn you that this dataset was very difficult to train the model on, and I recommend not using this dataset. It is small and very noisy. It served us to compare our model to codeNN, but even though our model performed better than the baselines - our trained model wasn't really "good" and useful as the Java method names dataset (presented in code2seq.org demo).
If you are looking for good code captioning datasets, I suggest checking the following: The CONCODE dataset This one from NAACL'19 This MSR one from ICLR'19
Maybe @sriniiyer knows - what is the currently best code-to-NL dataset? Is it CONCODE?
In any case, here are the parameters that we used on the CODENN dataset:
config.WORDS_MIN_COUNT = 20
config.TARGET_WORDS_MIN_COUNT = 2
config.EMBEDDINGS_SIZE = 128 * 4
config.RNN_SIZE = 128 * 4
config.DECODER_SIZE = 512
config.NUM_EXAMPLES = 52299
config.MAX_TARGET_PARTS = 37
config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.3
config.RNN_DROPOUT_KEEP_PROB = 0.75
Umm, those are good pointers. I'm not aware of other datasets. You can do it on CONCODE, but there isn't a test set with multiple references. Although it would be interesting to see if external class information from CONCODE helps at all in captioning.
Thanks a lot, @urialon for your answer! And thank you for the really quick response!
I am trying to apply code2seq to the task of generating commit messages, so I collected my own dataset from git. Thanks for the warning about other datasets. I saw this https://github.com/tech-srl/code2seq/issues/15 issue, where you said that it is always good to increase the amount of data for training. I've increased my dataset several times, training began to take about 4 hours for one epoch, so I decided to ask you about your amount of data.
Also, I could not find the following fields: WORDS_MIN_COUNT, TARGET_WORDS_MIN_COUNT, NUM_EXAMPLES
in config class. Did you use those fields just to describe the dataset, or are they actually used in the code?
Ah, you're right.
WORDS_MIN_COUNT
and TARGET_MIN_COUNT
were replaced by SUBTOKENS_VOCAB_MAX_SIZE
and TARGET_VOCAB_MAX_SIZE
.
I.e., instead of specifying the minimal count for inserting a word to the vocabulary, we currently specify the overall size, and take only the top-occurring words.
I don't have concrete guidelines here, except - limit the vocabulary size such that only words that appear only 1-2 times in the training data will be left outside.
Regarding NUM_EXAMPLES
- you can ignore it, it is counted automatically now.
I'm sorry to bother you again, but could you please tell me how the blue score was changing during training?
Thanks for the help with the fields and vocabulary!
Hmmm, I can't find the original log, but if I remember correctly (on the validation set): After the first epoch - BLEU was about 13-14, and it increased to 20 in about 10 epochs. From 10-30 it mostly fluctuated, with minor improvements every few epochs. But that's very specific to that CODENN dataset.
Hi, I just wondered if you have any BLUE scores of the Code2Seq model trained on the FUNCOM dataset: This one from NAACL'19
Thanks for your help!
No, sorry, code2seq was published earlier.
Hello!
May I ask you about applying your approach to the code captioning task?
You wrote in the article that used CodeNN's dataset and achieved ~23% bleu score. Am I right that the part of dataset for training contains 52812 examples of code snippets? How many epochs had passed before such accuracy was obtained?