microsoft / CodeBERT

CodeBERT
MIT License
2.09k stars 430 forks source link

Issues in CodeRefinement dataset #294

Open damengdameng opened 10 months ago

damengdameng commented 10 months ago

Hi, First of all thank you very much for your wonderful work! I'm currently trying to train a code review model using the CodeRefinement subset of the CodeReviewer Dataset. But I'm finding some general problems with old_hunk in the CodeRefinement data. Here is an example of an old_hunk counterpart in the dataset:

@@ -35,19 +35,28 @@ type metricsCollector struct {

 func (m metricsCollector) ProgramParsed(location common.Location, duration time.Duration) {
    if m.MetricsCollector != nil {
-       m.parsed += duration
+       // only capture tx parsing time
+       if _, ok := location.(common.TransactionLocation); ok {

According to the git diff syntax, +35, 28 means that this hunk contains 28 lines of code from oldf, starting at line 35. But apparently this hunk contains only 5 lines from the oldf. I am desperate to know what is the reason for this? Was the data intentionally processed this way? Very much looking forward to receiving your reply! Thanks you.