Issues with preprocessing the DeepCom dataset

aishwariyarao217 commented 3 years ago

Hello, I am trying to use the DeepCom dataset for the task of code search. I have a few problems with the dataset:

I have removed the HTML tags and chosen examples that have less than 40 tokens in the comments for my training data. This results in around 436725 examples in the training set along with the corresponding comments in the align.txt file. I followed the instructions in (https://github.com/tech-srl/code2seq/issues/45) for this.

The steps I have followed are: a) Generate .java files and write the corresponding comments to the align.txt file. (436k examples) b) Generate the train.raw.txt file using the Java extractor. This produces a file with the target_method_name and contexts. (380k examples) c) Replace the target method name with the comment sequence. d) Run preprocess.py on the file generated in step 3 to get the train.c2s file. (285k exampls)

1) When I run the Java Extractor on this training set to generate the train.raw.txt file with the method name and contexts, I end up with 381331 contexts. Could you shed some light on why this might be happening? Also, there is no way of mapping which examples are being dropped since the preprocess.sh file generates it's own target method names, correct? How can I map the example_id to the contexts so that I only insert the comments for the examples whose contexts have been generated?

2) After running preprocess.py on the file generated by step (c) my train.c2s file contains around 285584 examples. I am not sure why so manyexamples are being dropped.

3) I also didn't quite understand how the preprocess.sh scripts generates it's own target label. How does this work?

Thankyou!

urialon commented 3 years ago

Hi @aishwariyarao217 , Thank you for your interest in code2seq!

When I run the Java Extractor on this training set to generate the train.raw.txt file with the method name and contexts, I end up with 381331 contexts. Could you shed some light on why this might be happening?

I can't tell the exact reason why examples are dropped, but the JavaExtractor might drop them if they are too short (less than a single line) or if they do not parse.

After running preprocess.py on the file generated by step (c) my train.c2s file contains around 285584 examples.

I am not sure why so many examples are being dropped. The preprocess.py file is not supposed to drop examples at all. Can you please provide its console output?

I also didn't quite understand how the preprocess.sh scripts generates it's own target label. How does this work?

If I understand your question correctly - the example_id is just the line number in the original DeepCom file.

Best, Uri

aishwariyarao217 commented 3 years ago

Thankyou for the quick reply!

For question 3, my question is actually about how the Java Extractor generates the target label that should be predicted for each code snippet in the train.raw.txt file. (which is inserted as tokens in the beginning of the file before the contexts.)

I read in https://github.com/tech-srl/code2seq/issues/47 that preprocess.sh generates its own target labels.

urialon commented 3 years ago

Hi @aishwariyarao217 , In the standard pipeline (for method name prediction), the JavaExtractor just prints the method name as the target label.

In the modified pipeline for DeepCom, the JavaExtractor puts the filename as the target label. Later, the python script replaces that target label with the actual comment that should be generated.

I hope it helps. For exact details, see the gist scripts that are linked from #45 .

Best, Uri

aishwariyarao217 commented 3 years ago

Ah yes, sorry I got confused after reading the other post. I thought the target labels were being generated by the extractor and not extracted from the training set. Thanks for all your help.

tech-srl / code2seq

Issues with preprocessing the DeepCom dataset #88