Preprocessing CODENN dataset

shreyasingh commented 4 years ago

Hi,

I'm using the preprocess_csharp.sh to preprocess the code snippets given from CODENN dataset: https://github.com/sriniiyer/codenn/tree/master/data/stackoverflow/csharp. I have extracted the code snippets from the train, test and val files as given in the link and put them in different files to get it in the input format which is accepted by preprocess_csharp.sh. However, I realised that the first command which calls extract.py and generates the raw files doesn't process all the code snippets? It just generates the paths for the snippets which are functions and produces 0 results for small incomplete code snippets. Is this behavior expected? For example, for initial testing, I had created a train folder and put 5 .cs files in it, but the results were generated for only one of the five files which was different from the others as it was a function.

urialon commented 4 years ago

Hi @shreyasingh , The preprocess_csharp.sh script is designed to preprocess C# files, but it is currently not adapted to read the format of the CodeNN dataset (although it can be adjusted to read CodeNN).

preprocess_csharp.sh is designed to create a new dataset from C# sources, not specifically the CodeNN dataset that we used in the paper.

In general, we found the CodeNN dataset to be very noisy and I don't recommend using it. See additional discussion here: #17

shreyasingh commented 4 years ago

Thanks for your prompt reply. I am working on a NCS prototype for C# language in my internship and have to use the StackOverflow data, hence I thought of using the CodeNN dataset which has SO data in (q,a) pairs. Do you have any suggestions on how can the CodeNN code snippets be processed by adjusting the current codebase.

Another doubt which I had was that while eyeballing some of the outputs generated by extract.py in the raw files, I see a lot of arbitrary integers. For example, for this snippet:

void OnBtnClick(...) { List ctrls = new List(Controls);

Controls.Clear();

foreach(Control ctrl in ctrls )
     ctrl.Dispose();

Controls.Add(new yourContrl());

}

The output produced is this: on|btn|click void,-1754499423,METHOD_NAME control,-1121172649,list control,-1121172649,list control,-1783023667,controls control,-82843025,controls control,-82843025,controls control,-1559550682,ctrls control,455031522,ctrls control,1954938487,ctrls control,1767276292,clear control,-2069786047,ctrl control,1767276292,add control,1239719708,METHOD_NAME control,-1804999543,METHOD_NAME list,1038708574,controls list,-1074746772,ctrls list,-476475925,ctrls list,-1290997779,METHOD_NAME controls,357506982,ctrls controls,-961345789,ctrls controls,-961345789,ctrls controls,196749552,clear controls,1682134351,ctrl controls,1682134351,ctrl controls,196749552,add controls,459391427,your|contrl controls,-1615195496,METHOD_NAME controls,-1615195496,METHOD_NAME ctrls,1767276292,clear ctrls,-682017568,ctrl ctrls,-2069786047,ctrl ctrls,2107730483,dispose ctrls,1767276292,add ctrls,-670475966,METHOD_NAME ctrls,-1804999543,METHOD_NAME clear,-1923331196,ctrl clear,-14470946,METHOD_NAME ctrl,196749552,dispose ctrl,-701014142,dispose ctrl,-1708090184,add ctrl,1817118294,METHOD_NAME ctrl,577676812,METHOD_NAME dispose,-303184305,METHOD_NAME add,-1909679729,your|contrl add,-14470946,METHOD_NAME list,1841930479,list list,743915356,list ctrl,-809580137,ctrl ctrl,-1523702025,ctrl

Could you explain why is this happening? Thanks!

urialon commented 4 years ago

Oh I see. Using CodeNN for code search makes more sense than for code captioning.

Modifying the CSharpExtractor to work on CodeNN instead of extracting method names is easy:

Take a look at this line, and instead of working on a "list of files" (as it is now) - work on the different lines in the same file (because CodeNN provides the dataset as a single file, where every line is a different example).
Then, this line reads a file into a string. You don't need this, because you already have the string content (a line in the dataset) at this point.
You'll only need to extract the matching natural language label from the CodeNN data.

Regarding the hashed paths - there is a no_hash flag which is false by default here, i.e., do hash by default.

I just committed this change that makes this flag true by default, so if you pull now, you are not supposed to see this hashing.

Best, Uri

shreyasingh commented 4 years ago

Thanks Uri for your reply. The no_hash flag helped!

For now, I have written a small preprocessing script which pulls out code snippets from the CodeNN dataset into separate files - so that I don't have to change the code and continue to work with a 'list of files', Will change this eventually once I get a better hang of C#.

I still have two doubts lingering:

At this line I tried printing the pathString for a sample file: System.Web.Script.Serialization.JavaScriptSerializer oSerializer = new System.Web.Script.Serialization.JavaScriptSerializer(); string sJSON = oSerializer.Serialize(YOUR CLASS HERE);

But it didn't output anything. Is there limitation that only some specific type of code snippets would be processed?

Also just to confirm, I'll have to replace subtokensMethodName at this line with the NL query from CodeNN dataset right?

Thanks!

urialon commented 4 years ago

Hi @shreyasingh ,

The pathString object holds the string that you eventually see printed, like on|btn|click void,-1754499423,METHOD_NAME (without the hashing). I did not understand your code, why do you need that JavaScriptSerializer? pathString is just a string.
Exactly! subtokensMethodName is the natural language "label" that is associated with the code snippet. The words are separated by | for no special reason, and later in the python code I split the words by this character.

By the way, instead of printing, why don't you just run this in Visual Studio in debug mode?

shreyasingh commented 4 years ago

Hi Uri,

I think I didn't explain Doubt 1. correctly. What I meant was: I had a .cs sample source file which had the following code snippet: System.Web.Script.Serialization.JavaScriptSerializer oSerializer = new System.Web.Script.Serialization.JavaScriptSerializer(); string sJSON = oSerializer.Serialize(YOUR CLASS HERE);

JavaScriptSerializer is just a part of the C# code snippet which has to be processed and nothing else. When I preprocessed this file, I didn't get any paths for this snippet. So to check what's going on, I printed out pathString at this line which was null. So I wanted to know why no paths are being generated for such snippets? As a next step, I wrapped the above snippet up inside a dummy function with a dummy function name and it produced some paths. Just wanted to know if there's any other setting/hack through which we can get paths of such incomplete/small snippets?

Thanks for your suggestion, I'll used the debug mode!

urialon commented 4 years ago

Oh ok, sorry. Check whether the loop is even entered, if there are any paths to loop over.

Yes, it is a good idea to wrap it in a dummy function (and maybe even a dummy class), as early as you parse the file here. Check the type of the tree node to check whether it is a method/class, and if not, wrap it with a dummy method+class. Sorry for forgetting to tell you about this hack, I don't think that there are others.

urialon commented 4 years ago

Hi @shreyasingh , Did you manage to preprocess CODENN using our CSharpExtractor?

shreyasingh commented 4 years ago

Hi Uri,

Yes, I was able to preprocess CODENN. I'm just including a short snippet of the final my_dataset.val.c2s file. If you could just confirm about it's format being correct or not - that would be great. Essentially, I replaced the first token (target) by a tokenized NL query ('|' delimited). There might be multiple lines with the same first token (NL query) as a code snippet will return multiple Paths and I replace the first token (target) which can be a function name/method invocation by the NL query.

Just confirming - as next steps I would run the train.sh file with it pointing to the correct input files right? I hope to reproduce similar results for the code captioning task you presented in the paper (BLEU score of 23.08) and will go from there to model it into a code search approach. Is there anything else, I will need to do before the training?

Another small thing - I read in one of the closed issues that the code captioning task runs for about 31 epochs - could you let me know how much time each epoch takes and what machine you ran it on (CPU/GPU) and it's specs if you remember? I have the configs for the code captioning task which I found from one of the closed issues.

Thanks for your help and advice!

shreyasingh commented 4 years ago

One more doubt: For computing the BLEU score for code captioning task, this line has to be uncommented right? And I see it calls this function. So to run this, there would be some other libraries that needs to be installed to execute this command: ["perl", "scripts/multi-bleu.perl", ref_file_name]

urialon commented 4 years ago

Hi, This is good to hear. The data looks good, except that some tokens were replaced with METHOD_NAME. The reason is that in method name prediction - we don't want the ground truth method name to "leak" into the input. In your case, if the original method names are informative, this masking may hurt your results. On the other hand, If all examples are "wrapped dummy functions" - then the actual method name is a dummy name, so it doesn't matter.

This masking of the original method names is performed here: https://github.com/tech-srl/code2seq/blob/master/CSharpExtractor/CSharpExtractor/Extractor/Variable.cs#L82

Regarding training - the epochs are quite fast on a GPU, because the dataset is small. See also this closed issue https://github.com/tech-srl/code2seq/issues/17 for additional specs.

Regarding BLEU - I don't remember why, but this implementation uses the BLEU script from OpenNMT. If you want the same BLEU computation as in CodeNN (as we did in the paper) - use import bleu and then bleu_score = bleu.calc_bleu(TEST_REF_PATH, log_file_name). This might require pip install bleu. Notice that the reference file (TEST_REF_PATH) from the CodeNN dataset should have 3 times more lines than the model's predictions file, because they annotated manually each example 3 times (and the BLEU equations are suitable for using multiple references for each example, multiple "truths").

Sorry for this. We didn't want to release the CodeNN setup because we eventually realized that it is not a good benchmarking dataset. There is a constant string that if you program your model to always return this constant prediction - the model performs equally to most baselines (around 21 BLEU score). I mean - there is a constant string like "what do of a string" that if a dummy model always returns (like def model(x): return 'what do of a string') - it gets around 21 BLEU.

shreyasingh commented 4 years ago

Thanks Uri.

I still am not clear about how to calculate BLEU yet? In bleu_score = bleu.calc_bleu(TEST_REF_PATH, log_file_name), could you explain what is TEST_REF_PATH and what is log_file_name. I trained the model for about 10 epochs and in the evaluate function, three files get generated: log.txt, pred.txt and ref.txt. Are TEST_REF_PATH and log_file_name one of these?

For my model, I am passing my_dataset.train.c2s and my_dataset.val.c2s for training and validation respectively. These files have the same format as the snippet I included in my above message. These two files are being generated from train.txt and valid.txt which are available in the CodeNN repo here. So I'm a bit unclear on what files I should pass as TEST_REF_PATH and log_file_name? Another possible problem in my case that is happening is that since I have multiple lines in my train and valid with the same NL target - see line 16 and 17 in the image in my previous message, there will be multiple predictions for the same groundtruth. I'm attaching a snippet from log.txt to clarify this. Any tips on how to calculate BLEU for this would be helpful 👍

urialon commented 4 years ago

TEST_REF_PATH is CodeNN's "ground truth" file for the validation examples. log_file_name should be the pred.txt file.

In the test set the TEST_REF_PATHis supposed to have 3x more lines than the pred.txt file, because every code snippet has 3 possible ground truth annotations in CodeNN. But in the validation set - I think that every example has only a single annotation. Anyway, this is just a sanity check, the bleu.calc_bleu should take care of it.

If different, independent, examples have the same label - that's fine, I don't think it's a problem for the computation of BLEU. But it is weird, are you sure there is no bug in your matching between examples and their labels? Does this duplication appear also in the original CodeNN data files?

bhavyagera10 commented 4 years ago

@shreyasingh Can you please share how you went about preprocessing the CODENN dataset. Maybe the notebook for the same? Or just the changes you made to suit for CODENN ? Thanks a lot , would be grateful.

shreyasingh commented 4 years ago

Hi @bhavyagera10. I wont be able to share the code as I worked on it as a part of my internship and can't opensource it. But I can give you a few pointers:

I first modified files like Program.cs to read in the codenn train file as a whole. The current implementation takes in a list of files each of which has a C# function. But for Codenn we don't need that as each line of the file contains code snippet.
Some of the code snippets will not produce any paths as they are not inside a function body. For that, I would recommend to first wrap them inside a function - just add the string void testfunction(){ and } around the code snippet. I.e. preprocess the codenn dataset first wherever the code snippets are not in a function body before passing them through the extractor.
I would also recommend you to go through the files: extract.py, Program.cs, Extractor.cs, preprocess.py to understand the code. It is well-written and you can run parts of it to test.

Ping me on this thread if you need more help. Thanks.

bhavyagera10 commented 4 years ago

Thank you @shreyasingh , I have used this in preprocess_csharp.sh file. And am able to get the .c2s files generated. But when I train the model, I find that the database has PreFetchDatabase objects and not tensor objects. Did you also face this issue , else maybe my preprocessing is not right , but I see that preprocess_csharp.sh is working correctly. I attach my code for preprocess_csharp.sh sh1 sh2

tech-srl / code2seq

Preprocessing CODENN dataset #69