Hi, First of all thank you very much for your wonderful work!
I'm currently trying to train a code review model using the CodeRefinement subset of the CodeReviewer Dataset. But I'm finding some general problems with old_hunk in the CodeRefinement data.
Here is an example of an old_hunk counterpart in the dataset:
@@ -35,19 +35,28 @@ type metricsCollector struct {
func (m metricsCollector) ProgramParsed(location common.Location, duration time.Duration) {
if m.MetricsCollector != nil {
- m.parsed += duration
+ // only capture tx parsing time
+ if _, ok := location.(common.TransactionLocation); ok {
According to the git diff syntax, +35, 28 means that this hunk contains 28 lines of code from oldf, starting at line 35. But apparently this hunk contains only 5 lines from the oldf.
I am desperate to know what is the reason for this? Was the data intentionally processed this way? Very much looking forward to receiving your reply! Thanks you.
Hi, First of all thank you very much for your wonderful work! I'm currently trying to train a code review model using the CodeRefinement subset of the CodeReviewer Dataset. But I'm finding some general problems with old_hunk in the CodeRefinement data. Here is an example of an old_hunk counterpart in the dataset:
According to the git diff syntax, +35, 28 means that this hunk contains 28 lines of code from oldf, starting at line 35. But apparently this hunk contains only 5 lines from the oldf. I am desperate to know what is the reason for this? Was the data intentionally processed this way? Very much looking forward to receiving your reply! Thanks you.