Some of the train, test and valid files are removed after preprocess.sh

rishab-32 commented 3 years ago

Hi @urialon. I am running code2seq on a new dataset by following the instruction provided in the previous issues. In my train, test, and valid folders, I have 152924, 10112, and 4868 .java files respectively. However, after running the preprocess.sh file, the generated c2s files seem to miss some of the files from the original train/test/valid split and have (144398, 9461, and 4614 ) files in each c2s file.

Is there any possible change I need to make in the original preprocess.sh file? I would really appreciate your help.

Regards, Rishab

urialon commented 3 years ago

Hi @rishab-32 , First, the JavaExtractor creates a single example per method. Do you expect the number of examples to be equal to the number of files because you have a single method in every file?

Second, there are some thresholds that make the JavaExtractor skip methods or files. Is it possible that there were some very short methods that were skipped because of the flag --min_code_len ? The condition is checked here: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/Visitors/FunctionVisitor.java#L46

It is also possible that the method did not contain any paths, and will be skipped. This is checked here: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java#L105

The last option I can think of is that the file could not be parsed. In that case, this line: https://github.com/tech-srl/code2seq/blob/master/JavaExtractor/JPredict/src/main/java/JavaExtractor/FeatureExtractor.java#L77 Will throw an exception that will be silenced.

Do you have a way to find an example for a missing file?

Best, Uri

rishab-32 commented 3 years ago

Hi Uri, Thanks a lot for your cooperation. Yeah, I can extract some of the examples which are skipped during preprocessing. However, if I just go ahead with the shortened data, and train it, It would train fine until 36 epochs. However, after 36 epochs I get nan loss for further epochs. Is there anything I need to be careful about? On checking previous issues, it was mentioned that targets should not have "," on checking I found there are no "," however it has "<","=", "-". But the confusion is why it ran fine for first epoch?

rishab-32 commented 3 years ago

Continuing on this, the training does not halt or throws any error. It shows the nan value for the loss.

I am using the following parameters for training. For my dataset maxlen of code tokens and summary is 256 and 64 and minlen is 20 and 3 respectively. Also, I have seen that for this particular dataset I am getting better results with maxcontext=350. The batchhsize is reduced to 128 to fit into the GPU memory

config = Config(args) config.WORDS_MIN_COUNT = 20 config.TARGET_WORDS_MIN_COUNT = 3 config.NUM_EPOCHS = 3000 config.SAVE_EVERY_EPOCHS = 1 config.PATIENCE = 10 config.BATCH_SIZE = 128 config.TEST_BATCH_SIZE = 64 config.READER_NUM_PARALLEL_BATCHES = 1 config.SHUFFLE_BUFFER_SIZE = 10000 config.CSV_BUFFER_SIZE = 100 1024 1024 # 100 MB config.MAX_CONTEXTS = 350 config.SUBTOKENS_VOCAB_MAX_SIZE = 190000 config.TARGET_VOCAB_MAX_SIZE = 50000 config.EMBEDDINGS_SIZE = 512 config.RNN_SIZE = 128 * 4 # Two LSTMs to embed paths, each of size 128 config.DECODER_SIZE = 520 config.NUM_DECODER_LAYERS = 2 config.MAX_PATH_LENGTH = 8 + 1 config.MAX_NAME_PARTS = 5 config.MAX_TARGET_PARTS = 64 config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75 config.RNN_DROPOUT_KEEP_PROB = 0.5 config.BIRNN = True config.RANDOM_CONTEXTS = True config.BEAM_WIDTH = 0 config.USE_MOMENTUM = True

However still the summaries generated are empty or very short

urialon commented 3 years ago

Hi @rishab-32 , The nan loss after 36 epochs sounds like an optimization issue. Try to switch to USE_MOMENTUM = False.

rishab-32 commented 3 years ago

Hi, @urialon Thanks for all your cooperation and help. I have got the results for Code2seq, however, the results seem to be significantly worse (c2s 8 Bleu-1 and seq2seq 24) than the simple GRU based seq2seq model (most probably due to parameters used). Also, the summaries generated by C2S seem to be shorter than the seq2seq model. The embedding size is set the same for both of them (512). Is there any suggestion which you can make to improve the performance of code2seq. I am using these parameters for code2seq. The dataset I am using has a minimum count of 20 tokens in the code snippet and 3 for summaries and a maximum of 256 tokens for code and 64 for summaries.

config.WORDS_MIN_COUNT = 20 config.TARGET_WORDS_MIN_COUNT = 3 config.NUM_EPOCHS = 3000 config.SAVE_EVERY_EPOCHS = 1 config.PATIENCE = 10 config.BATCH_SIZE = 128 config.TEST_BATCH_SIZE = 64 config.READER_NUM_PARALLEL_BATCHES = 1 config.SHUFFLE_BUFFER_SIZE = 10000 config.CSV_BUFFER_SIZE = 100 1024 1024 # 100 MB config.MAX_CONTEXTS = 350 config.SUBTOKENS_VOCAB_MAX_SIZE = 190000 config.TARGET_VOCAB_MAX_SIZE = 50000 config.EMBEDDINGS_SIZE = 512 config.RNN_SIZE = 128 * 4 # Two LSTMs to embed paths, each of size 128 config.DECODER_SIZE = 520 config.NUM_DECODER_LAYERS = 2 config.MAX_PATH_LENGTH = 8 + 1 config.MAX_NAME_PARTS = 5 config.MAX_TARGET_PARTS = 64 config.EMBEDDINGS_DROPOUT_KEEP_PROB = 0.75 config.RNN_DROPOUT_KEEP_PROB = 0.5 config.BIRNN = True config.RANDOM_CONTEXTS = True config.BEAM_WIDTH = 0 config.USE_MOMENTUM = True

urialon commented 3 years ago

Hi @rishab-32 , I don't know. I don't see anything unusual in the hyperparameters. Maybe if you plot the training loss vs. the validation accuracy something will pop up.

Was there a difference after 36 epochs in validation accuracy when you used Adam?

urialon commented 3 years ago

Closing due to inactivity, feel free to re-open if there are any more questions.

tech-srl / code2seq

Some of the train, test and valid files are removed after preprocess.sh #76