wasiahmad / NeuralCodeSum

Official implementation of our work, A Transformer-based Approach for Source Code Summarization [ACL 2020].
MIT License
192 stars 79 forks source link

Generating summary for a file. #3

Closed Guardian99 closed 4 years ago

Guardian99 commented 4 years ago

After training, i have the tmp dir but how do i test the trained model on a single file?

wasiahmad commented 4 years ago

If you are using the scripts provided by us, then it should perform the testing right after training. Please check the bash scripts. For example, check the transformer.sh script for the Java dataset.

Guardian99 commented 4 years ago

But i want to see the output on a single file. How do i do that?

Zhangxq-1 commented 4 years ago

i have the same problem with you. i think we should rewrite the test file

wasiahmad commented 4 years ago

But i want to see the output on a single file. How do i do that?

What do you mean by that? The generated summaries are written in a JSON file. Did you read this section from README?

Guardian99 commented 4 years ago

Here is what i am looking for. After running the bash transformer.sh 0,1 code2jdoc, I have the files generated in the tmp directory with all the names mentioned in the README file. Now here is what i want. I have a java doc for which i want a summary. Since i have already trained the model and the files mentioned in README, how do i use these files to generate the summary for this single file that I want a summary of.

wasiahmad commented 4 years ago

I cited the bash scripts before to see how we perform the test. For example, this is the function that we use to perform test using the transformer model.

function test () {

echo "============TESTING============"

RGPU=$1
MODEL_NAME=$2

PYTHONPATH=$SRC_DIR CUDA_VISIBLE_DEVICES=$RGPU python -W ignore ${SRC_DIR}/main/train.py \
--only_test True \
--data_workers 5 \
--dataset_name $DATASET \
--data_dir ${DATA_DIR}/ \
--model_dir $MODEL_DIR \
--model_name $MODEL_NAME \
--dev_src test/code.${CODE_EXTENSION} \
--dev_src_tag test/code.${CODE_EXTENSION} \
--dev_tgt test/javadoc.${JAVADOC_EXTENSION} \
--code_tag_type $CODE_EXTENSION \
--use_code_type False \
--uncase True \
--max_src_len 150 \
--max_tgt_len 50 \
--max_examples -1 \
--test_batch_size 64

}

Here, you need to note the flags: dev_src, dev_src_tag, and dev_tgt. It is important to note that, our codebase expects that you have the target summary to compute the evaluation metric values.

If you only want to generate summaries without performing an evaluation, you need to modify the code, which would be very easy. I will try to add provision so that a summary can be generated without performing an evaluation. Hopefully, in a week!

anshul17024 commented 4 years ago

Hi, I am having the same problem. I also want to generate the summary instead of evaluating it. Do I need to generate the subtokens as well or a direct paste in a test directory would give the summary. Or if you can in short tell where do I need to make the changes?

anshul17024 commented 4 years ago

And can you also please tell how have you generated subtoken files?

wasiahmad commented 4 years ago

Finally, the wait is over. I have updated the codebase. Please read this section to generate summaries from a source code input file.

wasiahmad commented 4 years ago

And can you also please tell how have you generated subtoken files?

Please see the code_tokenizer to see how we performed sub-tokenization.

anshul17024 commented 4 years ago

Thank you so much sir. The generate.sh is working smoothly. Can you please tell how to use the tokenizer files, as there are 3 such files and no description is there about them. Also, there is no input asked in the tokenized files, so can you please tell the steps how to use them to generate the subtoken files. It will be really helpful.

wasiahmad commented 4 years ago

The tokenizers are pretty simple and straight-forward. You can instantiate those tokenizer classes to tokenize source codes for CamelCase and snake_case bases sub-tokenization.

The preprocessing of the source code should be done as per prior works (as mentioned in the paper). We used preprocessed code in our experiments, we didn't preprocess by ourselves. Only the sub-tokenization is done by us and we provide the tokenizer for that.

Please give some effort to understand the codebase. It will save both of our time which is precious.

Guardian99 commented 4 years ago

The repository of TLCodeSum by Xing-Hu does not give any explanation for the preprocessing part @wasiahmad . There are open as well as closed issues for the same without any answers for the query raised.

wasiahmad commented 4 years ago

From section 4.1 of the paper:

We use Eclipse’s JDT compiler to parse source code into AST trees. Then we extract the Java methods, the API sequences within these methods, and the corresponding Javadoc comments which are standard comments for Java methods. These comments that describe the functionalities of Java methods are taken as code summaries. The source code is tokenized into tokens before they are fed into the network. To decrease noise introduced to the learning process, we only take the first sentence of the comments since they typically describe the functionalities of Java methods according to Javadoc guidance. However, not every comment is useful, so some heuristic rules are required to filter the data. Methods with empty or just one-word descriptions are filtered out in this work. The setter, getter, constructor, test methods, and override methods, whose comments are easy to predict, are also excluded.

You can get the tokenized source code from AST.

A simpler alternative is to consider white-space based tokenization for source code.

anshul17024 commented 4 years ago

Thank you sir. Your responses were very useful! :)

Guardian99 commented 4 years ago

Thank you for being so active in answering the queries.