mozilla / bugbug

Platform for Machine Learning projects on Software Engineering
Mozilla Public License 2.0
504 stars 311 forks source link

[code_review] Use past review comments for finetuning #4387

Open marco-c opened 3 months ago

marco-c commented 3 months ago

Past review comments are sometimes very short and might lack enough context to be good examples for finetuning a LLM.

We could prompt GPT-4 with the patch and the provided comment, asking it to explain how it got there, and then finetune a smaller LLM to learn what GPT-4 did.

For example, I prompted GPT-3.5 with:

You are an expert reviewer for the Mozilla Firefox source code. You reviewed the given patch and provided a constructive code review comment for the hunk.

### Patch:
diff --git a/docshell/base/nsDocShell.cpp b/docshell/base/nsDocShell.cpp
--- a/docshell/base/nsDocShell.cpp
+++ b/docshell/base/nsDocShell.cpp
@@ -8264,13 +8264,24 @@
     return NS_OK;
   }

   // Check if the caller is from the same origin as this docshell,
   // or any of its ancestors.
-  nsCOMPtr<nsIDocShellTreeItem> item(this);
+  RefPtr<BrowsingContext> bc(mBrowsingContext);
   do {
-    nsCOMPtr<nsIScriptGlobalObject> sgo = do_GetInterface(item);
+    // If the given BrowsingContex is not in process, then it
+    // is true by construction that its principal will not
+    // subsume the current docshell principal.
+    if (!bc->IsInProcess()) {
+      bc = bc->GetParent();
+      continue;
+    }
+
+    nsCOMPtr<nsIDocShell> docShell(bc->GetDocShell());
+    MOZ_ASSERT(docShell);
+
+    nsCOMPtr<nsIScriptGlobalObject> sgo = docShell->GetScriptGlobalObject();
     nsCOMPtr<nsIScriptObjectPrincipal> sop(do_QueryInterface(sgo));

     nsIPrincipal* p;
     if (!sop || !(p = sop->GetPrincipal())) {
       return NS_ERROR_UNEXPECTED;
@@ -8279,14 +8290,12 @@
     if (nsContentUtils::SubjectPrincipal()->Subsumes(p)) {
       // Same origin, permit load
       return NS_OK;
     }

-    nsCOMPtr<nsIDocShellTreeItem> tmp;
-    item->GetInProcessSameTypeParent(getter_AddRefs(tmp));
-    item.swap(tmp);
-  } while (item);
+    bc = bc->GetParent();
+  } while (bc);

   return NS_ERROR_DOM_PROP_ACCESS_DENIED;
 }

 //*****************************************************************************

### Hunk:
@@ -8264,13 +8264,24 @@
     return NS_OK;
   }

   // Check if the caller is from the same origin as this docshell,
   // or any of its ancestors.
-  nsCOMPtr<nsIDocShellTreeItem> item(this);
+  RefPtr<BrowsingContext> bc(mBrowsingContext);
   do {
-    nsCOMPtr<nsIScriptGlobalObject> sgo = do_GetInterface(item);
+    // If the given BrowsingContex is not in process, then it
+    // is true by construction that its principal will not
+    // subsume the current docshell principal.
+    if (!bc->IsInProcess()) {
+      bc = bc->GetParent();
+      continue;
+    }
+
+    nsCOMPtr<nsIDocShell> docShell(bc->GetDocShell());
+    MOZ_ASSERT(docShell);
+
+    nsCOMPtr<nsIScriptGlobalObject> sgo = docShell->GetScriptGlobalObject();
     nsCOMPtr<nsIScriptObjectPrincipal> sop(do_QueryInterface(sgo));

     nsIPrincipal* p;
     if (!sop || !(p = sop->GetPrincipal())) {
       return NS_ERROR_UNEXPECTED;

### Comment:
This can become `for (RefPtr<BrowsingContext> bc = mBrowsingContext; bc; bc = bc->GetParent())`

In addition to the patch, what information did you look for in order to provide this comment?
Mention anything you had to do, for example expanding the context of the patch, looking for a function definition or implementation, looking for the documentation of a class, looking for other usages of the same functions in other parts of the code base, and so on.

And it replied with:

To provide the constructive code review comment for the given hunk, I examined the context around the code changes and identified an opportunity for improvement. Here's a breakdown of the information I looked for:

1. **Understanding the Code Context:**
   - I analyzed the provided patch to understand the changes made to the `nsDocShell.cpp` file.
   - I identified the loop structure checking the origin and ancestors of the docshell using a `do-while` loop.

2. **Understanding BrowsingContext Usage:**
   - I noticed the introduction of a `RefPtr<BrowsingContext> bc(mBrowsingContext);` to reference the `BrowsingContext` associated with the current docshell.

3. **Identifying Redundant Code:**
   - I observed that the loop was using `bc = bc->GetParent();` and then retrieving the `nsIDocShell` associated with the current `BrowsingContext`. This process repeats until there is no parent (`bc`) left.

4. **Suggesting Code Improvement:**
   - I suggested a more concise and readable way to express the loop using a `for` loop. The suggested code uses a single line to initialize, check the condition, and update the `BrowsingContext`.

5. **Ensuring Correctness:**
   - I made sure that the suggested modification retains the intended logic and behavior of the original code.

In summary, the provided comment aims to improve code readability and maintainability by suggesting a more concise loop structure using a `for` loop. The suggested change does not alter the logic of the code but enhances its clarity.

It isn't perfect, but it was just a quick test. WIth GPT-4 and with more structure we could have something better.

marco-c commented 2 weeks ago

Using the accepted examples (+ human comments? + synthetic comments from bug fixes?) as input for the finetuning. Use preference optimization to also learn from the rejected examples? We should find out what is the cost of hosting the finetuned models.