kwz219 / NPR4J

17 stars 5 forks source link

Instruction on running test cases #7

Closed thanhlecongg closed 1 year ago

thanhlecongg commented 1 year ago

Hi,

First, thank you very much for your useful framework and interesting paper.

I'm trying to use your framework to run APR techniques. However, I struggle on recovering the original code in Defects4J from predictions for running the test cases. In particular, I do not know how to map from current (bug, fix) pairs to original bugs Defects4J. Could you provide more instructions and metadata for this purpose. Many thanks.

kwz219 commented 1 year ago

Hi, @thanhlecongg ! Do you mean matching ids in our data to the original defects4j bug? If so, please refer to the meta information in this folder https://drive.google.com/drive/folders/1LulqZWVftmFevh-DeCtjA3HLKarXje7q

Binfo_d4j.json contains concrete bug information, Minfo_d4j.json contains meta information of the buggy function.

thanhlecongg commented 1 year ago

Thank for your quick reply. This is what I want. Btw, I have one more question: The paper mentioned NPR4J use 260 bugs for evaluation but after preprocessing we obtain ~400 bugs (394 bugs for SeqR and 400 bugs for CocoNUT). Could you provide ids of 260 bugs. Thanks.

kwz219 commented 1 year ago

please download EvaluationBenchmarks.zip from this link https://drive.google.com/file/d/1dSVsGaU9z1Q3a1AU-KKPmqp6HO_xvUOs/view?usp=share_link EvaluationBenchmarks/d4j.ids contains the ids we used. To ease the experiment, we only evaluate NPR models on one-line replacement bugs and bugs composed of one-line replacement bugs. So we filter the 835 bugs of Defects4J V1+V2 and get 260 bugs (nearly 220 are one-line bugs and the other are composed of multiple one-line changes)

thanhlecongg commented 1 year ago

Yes, I have downloaded the file. But I obtained 400 ids instead of 260 ids. Please see image for a screen shot.

kwz219 commented 1 year ago

Each id only represents a "one-line replacement" code change. Some bugs are composed of multiple "one-line replacement" (i.e., consists of several ids). At each time, NPR systems only focus on generating patch codes for one id. When evaluating, if the bug is composed with multiple ids, we sequentially replace id-patches to each id-position. For example, bug A has 3 hunks (represents by 3 ids), and we generate 2 candidates for each hunk (id). Then patch-apply sequence will be: (1,1,1), (1,1,2), (1,2,1), (1,2,2), (2,1,1) ......

thanhlecongg commented 1 year ago

Thanks. I got it. But, as NPR4J produce top-100 predictions, is the number of patches need to be tested for a bugs with 3 hunks very large (100^{3})?

kwz219 commented 1 year ago

Yes, so wo set maximum total evaluation times for each bug. For example, to limit the total evaluation times to 100 on a 3-hunk bug, we first calculate a max X that satisfy X^{3} <= 100, then the sequence will be (from 1 to X, from 1 to X, from 1 to X).

thanhlecongg commented 1 year ago

Everything make sense for me now. Many thanks for your kind explainations.

thanhlecongg commented 1 year ago

Hi,

I followed your instructions to run the experiments. We found that from the hunk ids (in file d4j.ids.new), only 191 out of 260 Defects4J bugs contain all modified hunks in Binfo_d4j.json while others 69 bugs only contain at least 1 hunks but do not contains all hunks. I wonder why some hunks from these 69 bugs were not considered in your evaluation. Is this due to "Step 3: Purifying and enriching evaluation resources" in your dataset constructions? And, why you do not only consider the bugs containing all hunks as APR cannot completely fix the bugs that do not contains all hunks?

Many thanks. Have a nice day.

thanhlecongg commented 1 year ago

This is id of 69 bugs I mentioned: ['Closure-165', 'JacksonDatabind-103', 'Closure-155', 'Closure-134', 'Closure-34', 'JxPath-13', 'JacksonDatabind-38', 'Cli-39', 'Jsoup-87', 'JacksonDatabind-10', 'Closure-90', 'JacksonCore-12', 'Math-18', 'Closure-147', 'Closure-169', 'JacksonDatabind-15', 'Mockito-17', 'Closure-157', 'JacksonDatabind-52', 'Closure-27', 'Compress-47', 'JxPath-16', 'Closure-108', 'Closure-148', 'Mockito-14', 'Closure-72', 'Mockito-11', 'Closure-144', 'Jsoup-92', 'Closure-37', 'Chart-18', 'Mockito-23', 'JxPath-20', 'Time-26', 'Gson-4', 'JacksonDatabind-55', 'Math-83', 'Closure-149', 'Closure-163', 'Lang-32', 'Closure-167', 'Math-100', 'Cli-31', 'Closure-89', 'JacksonDatabind-31', 'JacksonDatabind-95', 'Closure-30', 'Math-81', 'JacksonCore-17', 'Closure-100', 'Chart-22', 'JacksonCore-24', 'Math-47', 'Cli-1', 'Cli-13', 'JacksonDatabind-53', 'JacksonDatabind-65', 'JacksonDatabind-73', 'Closure-75', 'JacksonDatabind-14', 'Lang-36', 'Lang-15', 'Cli-33', 'Closure-9', 'Mockito-4', 'Cli-18', 'Math-65', 'JacksonDatabind-108', 'Math-62']

This is my code to count these bugs.

import json
import codecs
def load_info(line_bug_info, method_bug_info):
    bug_info = {}
    for idx in range(len(line_bug_info)):
        assert line_bug_info[idx]["parent_id"] == method_bug_info[idx]["_id"]
        hash_id = line_bug_info[idx]["_id"]["$oid"]
        tmp = line_bug_info[idx]["parent_id"].split("\\")[-1].split("/")[1].split(".")[0].split("_")
        bug_id = "-".join(tmp[0:2])
        bug_class = tmp[2]
        start_line = method_bug_info[idx]["BLine_buggy"]
        end_line = method_bug_info[idx]["ELine_buggy"]
        bug_method = method_bug_info[idx]["methodname"]

        if bug_id not in bug_info:
            bug_info[bug_id] = {}

        bug_info[bug_id][hash_id] = {"bug_class": bug_class, 
                             "bug_method": bug_method, 
                             "start_line": start_line,
                             "end_line": end_line }
    return bug_info

def main():
    line_bug_info = json.load(codecs.open("meta_info/Binfo_d4j.json",'r',encoding='utf8'))
    method_bug_info = json.load(codecs.open("meta_info/Minfo_d4j.json",'r',encoding='utf8'))
    bug_info = load_info(line_bug_info, method_bug_info)

    d4j_ids = []
    with open("d4j.ids.new", "r") as f:
        for line in f:
            d4j_ids.append(line.strip().split("_")[1])

    cnt = 0
    al = []
    for bug_id, info in bug_info.items():
        is_valid = True
        at_least = False
        for hunk_id in info.keys():
            if hunk_id not in d4j_ids:
                is_valid = False
            else:
                at_least = True

        if not is_valid and at_least:
            al.append(bug_id)
            cnt += 1
    print(al)

if __name__ == "__main__":
    main()
kwz219 commented 1 year ago

Hi, @thanhlecongg, thank you for pointing out that. I remember when constructing the d4j ids, I only get all ids that with type "replace", so hunks with other types are excluded. For convenience, I didn't exclude ids that are incomplete to fix a bug and just translate them all. When applying patches to d4j projects, I first check if the bug can be fix (all id hunks have generated patches), if the bug can't be fixed, I just skip the validation for this bug. Yes, the actual validated bugs should be 191 but not 260 (but I forget to check this). And some identical multi-hunk bugs such as Cli_31 and Gson_4 should be partial-fixed and I will correct the data. Very much appreciate it!

kwz219 commented 1 year ago

@thanhlecongg , by the way, my data-process way of defects4j is not good, since some NPR systems can also handle other type of bugs beyond "one-line replacement bugs". So now I find a perfect way to parse d4j: prepare block-level hunk rather than line-level hunk. Then nearly all bugs of Defects4j can be parsed into forms that can be translated by NPR systems. I shared the block-level file in https://drive.google.com/drive/folders/14sPk9WM2oHEklS7Na4G_PYSeKNEGqqxc. Hope this can help you.

thanhlecongg commented 1 year ago

Thanks you for your kind explaination and sharing. They help me a lot.

thanhlecongg commented 1 year ago

Hi, I just found a problem with your data in the bug 'JacksonDatabind-60'. In your dataset, there is only one changed hunk, i.e. 61a8cca58009e7c4a5d3d60b. However, when I manually compare the fix version to the buggy version, I found that there are much more added code than the hunk. As a results, when I patch the buggy version using your ground truth, the program even cannot be compiled as the object TypeSerializerRerouter, which appear in the fixed code, do not exists in the buggy version. Similar to this cases, I found the following bugs facing the same problem: ['JacksonDatabind-60', 'Compress-34', 'Compress-42', 'Mockito-30', 'Compress-43', 'Mockito-21', 'Closure-97', 'Mockito-31', 'Closure-16', 'JacksonDatabind-110', 'Mockito-10', 'Math-66', 'Gson-3', 'JxPath-11', 'Closure-64', 'Lang-46', 'Cli-10', 'Closure-3', 'Compress-39', 'Mockito-32', 'Math-15', 'Closure-127', 'JacksonDatabind-75', 'Mockito-19', 'Time-10'] Could you please kindly check these cases and advice? Many thanks.

thanhlecongg commented 1 year ago

Besides, we also found other problems when testing bugs Lang-4. Particularly, in your meta_data, I only can found only one changed hunk, i.e.,

  "_id": "D:\\DDPR_DATA\\Defects4j\\BF_Rename/Lang_4_LookupTranslator.buggy@public int translate(final CharSequence input, final int index, final Writer out) throws IOException",
  "methodname": "translate",
  "commitID": "defects4j_Lang_4_LookupTranslator",
  "BLine_buggy": 68,
  "Bline_fix": 68,
  "ELine_buggy": 84,
  "Eline_fix": 84,
  "buggy_file": "D:\\DDPR_DATA\\Defects4j\\BF_Rename/Lang_4_LookupTranslator.buggy",
  "fix_file": "D:\\DDPR_DATA\\Defects4j\\BF_Rename/Lang_4_LookupTranslator.fix"

However, following defects4j-dissection, Lang-4 should be fixed in three hunks. As a results, when I patch the buggy version using your ground truth, the program still fail on the test cases. Note that, all three hunks are one-line replacement fixes so I think the missing hunks will not be removed in your preprocessing. I wonder if your metadata is missing or I'm misunderstanding sth. Could you also check the case? Many thanks.

kwz219 commented 1 year ago

Hi @thanhlecongg, many thanks to your questions! You mentioned two problems: (1) patches of some multi-hunk bugs are not correct. Yes, you're right. We found we made a mistake when processing the data. As a result, we wrongly recognize some multi-hunk bugs as one-hunk bugs. And when evaluating patches, we first check if the patch is identical to the developer-patch (if so, we do not run the test cases for this bug ). So some patches are wrongly labeled as correct. The current versioin of evaluate_results.zip contains some mistakes. The good news is, after this problem found, we have re-run and re-evaluate the experiment considering more NPR systems and more candidates (up to 300), the latest manual-check results can be found here: https://docs.google.com/spreadsheets/d/11oUYyEiMnDfHRONSrB9hY1smXcrroJSN/edit?usp=sharing&ouid=116802316915888919937&rtpof=true&sd=true The result in this sheet should be more accurate. (2) meta-data missing. To be honest, I'm not sure if there is a bug in my d4j pre-processing method (using JavaParser), I may need further check my codes. So the meta file may also contain some errors. For a more accurate meta file, please use: https://drive.google.com/file/d/1DLvu8NCdhzUHNWvUOG2ywlne3rqPOlkB/view?usp=sharing (we use block-level instead of the line-level parse).

Hope my answer could help. Again, thanks for your questions. I believe they are important to help me improve this framework and refine the eval results. So please feel free to contact me again if you find any other problem!

thanhlecongg commented 1 year ago

Many thanks for your kind explaination and advice. I really appreciate it.