A few questions about ERule dataset and repair procedure

breandan commented 1 year ago

Hi @gsakkas, I hope you are doing well. I am not sure if you recall, but we met briefly after your talk in New Zealand last December. I am working on reproducing the results on the 15k ERule and HumanEval dataset and had a few questions about the abstract sequences used in section 7.1-7.4 of the paper. Any suggestions or advice you could provide would be greatly appreciated.

Anonymized dataset availability. Would it possible to release the data from the postprocessed PythonTutor dataset (possibly with obfuscated identifiers to preserve anonymity)? Is the only concrete source code available to evaluate are the 50 programs from the HumanEval dataset in src/human_study, or is there another test set of source code snippets?
Ground truth abstract repair. Is it possible to recover the ground truth abstract fix sequence from the 15k ERule dataset? I see each row has three columns, tokns, tok_chgs, dur, popular, and predict_eccp_classifier_partials.py compares the classifier prediction y_pred with the tok_chgs using the labels file erule_labels-partials-probs.json, however I am not quite sure how to obtain the ground truth abstract user fix from this information. For example, if we consider:

Stmts_Or_Newlines is _NAME_ == _NAME_ _NEWLINE_ _NEWLINE_ _ENDMARKER_ <||> Err_Literals -> H Literals <++> InsertErr -> is <||> 1 <||> 33.0 <||> popular

I understand tok_chgs is Err_Literals -> H Literals <++> InsertErr -> is which refers to [105, 323], but it is not yet clear to me how tokns are altered in the ground truth fix. Does the suffix after _ENDMARKER_ identify a unique abstract sequence fix?

Mapping abstract tokens back to concrete source code. I see there is a procedure which decodes the abstract token sequence back to concrete tokens, but it seems to require the original code sequence and some corruption may be possible during post-repair decoding. For example, if a _NAME_ or whitespace is substituted, inserted or deleted in the abstract token sequence, this can introduce cosmetic changes to parts of the input which are lexically identical in the abstract token sequence. Is there a way to map the tokenwise edits back to the exact character subsequence in the concrete source code while preserving the original formatting?

It is also possible I am mistaken or misunderstanding an important detail. If so, any clarification would be welcome. Thank you!

cc: @jin-guo @xujiesi

breandan commented 1 year ago

Hi George, just a quick update in case you were working on the anonymized dataset. I was able to partially reproduce the seq2parse results on an alternate dataset from Wong et al. (2019), however the source code predictions are a little tricky to compare due to the aforementioned issue with mapping abstract sequences back to character sequences. Although I wasn't sure sure how to obtain the Precision@{10,20,50} over concrete source code, I was able to run the seq2parse.py script and based on a Top-1 analysis of ~400 broken/fixed pairs from the StackOverflow dataset containing <3 abstract token edits, roughly ~86% of the Seq2Parse repairs were syntactically valid, ~35% matched the abstract tokens from the human fixes, and ~0.5% matched the human fixes on a character level. Are those numbers drastically out of line with what we should expect? Also FYI, the web demo now seems to be unavailable. Thank you again.

gsakkas commented 1 year ago

Hi @breandan,

Sorry for the late reply, it's being quite busy. Of course I remember our talk back in December and nice to hear from you again!

Anonymized dataset availability. Unfortunately, we can't make the data public at this point because of the nature of the programs. They all come from PythonTutor.com and may contain anything (e.g. personal information, university names etc.). And we are not working on these data anymore in order to anonymize it. We may be able to send the test set of 15K programs directly to you, depending the situation. I have to ask for that though. As you mentioned, only the human study programs are available now.
Ground truth abstract repair. I am not sure that I understand the question well. You can see create_ecpp_dataset_full.py for how the training and test sets were created, using ecpp_individual_grammar.py (i.e. the Earley Parser) to abstract the programs. _ENDMARKER_ just signifies the end of a program in our grammar python-grammar.txt. There is no repairing happening here, just predicting the error rules.
Mapping abstract tokens back to concrete source code. Well, you are right, this procedure introduced some errors in the past due to misalignment and probably a much better procedure exists. But I have tested it a lot and haven't found such problems anymore. If you encounter any let me know. The way it works is use the actual_tokens reversed as a stack in order to fill in the information that haven't changed. The fix_seq_operations tell us if we have a token insertion, deletion or replacement and we use the stack accordingly to avoid using the wrong tokens. For example, you will see in the first 2 cases for '<<+' and '<<$' that we are inserting or replacing a token, we use new dummy ones such as "simple_name". Unfortunately, they way we developed the error-correcting parser back then didn't allow us to avoid some cosmetic changes and preserve formatting. seq2parse.py kinda tries to that with the diff, but I think it might not be too accurate.
StackOverflow dataset results. The numbers you mention seem reasonable for a different dataset, with probably bigger and a bit more complicated programs.

Let me know if you have any more questions.

gsakkas / seq2parse

A few questions about ERule dataset and repair procedure #1