rjust / defects4j

A Database of Real Faults and an Experimental Infrastructure to Enable Controlled Experiments in Software Engineering Research
MIT License
704 stars 299 forks source link

Are comments giving away bug location/fix hints? #588

Closed furunkel closed 2 weeks ago

furunkel commented 1 month ago

I'm not sure if these patch files are the ones actually used. If so, I've found that in some of them comments that I think were introduced together with the fix also appear in the "buggy version". Note that I'm absolutely not sure this is really the case, it just seems like it.

As an example, in Lang#7 (below) I think the comment between lines 20 and 23 is part of the context, while I think it should be part of the patch. https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Lang/patches/7.src.patch#L19-L24

Should these comments be in the original version (again, I might be mistaken) then repair and localization models might use these comments as a shortcut?

furunkel commented 1 month ago

There seem to be quite a few other cases that I find strange (for now just looked at the Time project). I tried doing a checkout with the D4J scripts for a couple of them and the buggy version indeed showed indentation and/or comments that hint at the bug location. E.g., the in Time#3, the setMillis call is indented to a level that suggests it should be wrapped by an if. I guess LLMs can easily pick these things up?

Similarly, in other cases, such as Time#12, there appear comments (with or without suggestive indentation) that seem to belong to the insertion. This is how my buggy checkout of Time#12 looks, the //handle years in era BC comment seems out of place.

    @SuppressWarnings("deprecation")
    public static LocalDateTime fromDateFields(Date date) {
        if (date == null) {
            throw new IllegalArgumentException("The date must not be null");
        }
            // handle years in era BC
        return new LocalDateTime(
            date.getYear() + 1900,
            date.getMonth() + 1,
            date.getDate(),
            date.getHours(),
            date.getMinutes(),
            date.getSeconds(),
            (((int) (date.getTime() % 1000)) + 1000) % 1000
        );
    }

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/3.src.patch#L38-L42

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/27.src.patch#L8-L14

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/21.src.patch#L25-L42

I guess the bug comment (// bug [3264409]) should be part of the insertion? https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/22.src.patch#L9-L17

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/17.src.patch#L30-L40

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/12.src.patch#L50-L55

https://github.com/rjust/defects4j/blob/e85036c044320f11df22d3f23f248fdd4e99a41b/framework/projects/Time/patches/14.src.patch#L9-L14

mernst commented 1 month ago

You are right that comments associated with the bug fix may already appear in the code. And you are right that this might unfairly aid repair and localization models that pay attention to comments or indentation.

The bug description says:

Defects4J was designed with code-focused techniques in mind, and so it isn't perfect for all tasks.

To address the issues you raised, you could reformat the code and/or remove the comments that were added in the commit, or all comments. These are all automatable. If you do so, we would accept a pull request with the changes and/or a tool that other people could use.

furunkel commented 2 weeks ago

Thanks for clarifying!