Closed Guardian99 closed 4 years ago
If you are using the scripts provided by us, then it should perform the testing right after training. Please check the bash scripts. For example, check the transformer.sh script for the Java dataset.
But i want to see the output on a single file. How do i do that?
i have the same problem with you. i think we should rewrite the test file
But i want to see the output on a single file. How do i do that?
What do you mean by that? The generated summaries are written in a JSON file. Did you read this section from README?
Here is what i am looking for. After running the bash transformer.sh 0,1 code2jdoc, I have the files generated in the tmp directory with all the names mentioned in the README file. Now here is what i want. I have a java doc for which i want a summary. Since i have already trained the model and the files mentioned in README, how do i use these files to generate the summary for this single file that I want a summary of.
I cited the bash scripts before to see how we perform the test. For example, this is the function that we use to perform test using the transformer model.
function test () {
echo "============TESTING============"
RGPU=$1
MODEL_NAME=$2
PYTHONPATH=$SRC_DIR CUDA_VISIBLE_DEVICES=$RGPU python -W ignore ${SRC_DIR}/main/train.py \
--only_test True \
--data_workers 5 \
--dataset_name $DATASET \
--data_dir ${DATA_DIR}/ \
--model_dir $MODEL_DIR \
--model_name $MODEL_NAME \
--dev_src test/code.${CODE_EXTENSION} \
--dev_src_tag test/code.${CODE_EXTENSION} \
--dev_tgt test/javadoc.${JAVADOC_EXTENSION} \
--code_tag_type $CODE_EXTENSION \
--use_code_type False \
--uncase True \
--max_src_len 150 \
--max_tgt_len 50 \
--max_examples -1 \
--test_batch_size 64
}
Here, you need to note the flags: dev_src
, dev_src_tag
, and dev_tgt
. It is important to note that, our codebase expects that you have the target summary to compute the evaluation metric values.
If you only want to generate summaries without performing an evaluation, you need to modify the code, which would be very easy. I will try to add provision so that a summary can be generated without performing an evaluation. Hopefully, in a week!
Hi, I am having the same problem. I also want to generate the summary instead of evaluating it. Do I need to generate the subtokens as well or a direct paste in a test directory would give the summary. Or if you can in short tell where do I need to make the changes?
And can you also please tell how have you generated subtoken files?
Finally, the wait is over. I have updated the codebase. Please read this section to generate summaries from a source code input file.
And can you also please tell how have you generated subtoken files?
Please see the code_tokenizer to see how we performed sub-tokenization.
Thank you so much sir. The generate.sh is working smoothly. Can you please tell how to use the tokenizer files, as there are 3 such files and no description is there about them. Also, there is no input asked in the tokenized files, so can you please tell the steps how to use them to generate the subtoken files. It will be really helpful.
The tokenizers are pretty simple and straight-forward. You can instantiate those tokenizer classes to tokenize source codes for CamelCase and snake_case bases sub-tokenization.
The preprocessing of the source code should be done as per prior works (as mentioned in the paper). We used preprocessed code in our experiments, we didn't preprocess by ourselves. Only the sub-tokenization is done by us and we provide the tokenizer for that.
Please give some effort to understand the codebase. It will save both of our time which is precious.
The repository of TLCodeSum by Xing-Hu does not give any explanation for the preprocessing part @wasiahmad . There are open as well as closed issues for the same without any answers for the query raised.
From section 4.1 of the paper:
We use Eclipse’s JDT compiler to parse source code into AST trees. Then we extract the Java methods, the API sequences within these methods, and the corresponding Javadoc comments which are standard comments for Java methods. These comments that describe the functionalities of Java methods are taken as code summaries. The source code is tokenized into tokens before they are fed into the network. To decrease noise introduced to the learning process, we only take the first sentence of the comments since they typically describe the functionalities of Java methods according to Javadoc guidance. However, not every comment is useful, so some heuristic rules are required to filter the data. Methods with empty or just one-word descriptions are filtered out in this work. The setter, getter, constructor, test methods, and override methods, whose comments are easy to predict, are also excluded.
You can get the tokenized source code from AST.
A simpler alternative is to consider white-space based tokenization for source code.
Thank you sir. Your responses were very useful! :)
Thank you for being so active in answering the queries.
After training, i have the tmp dir but how do i test the trained model on a single file?