sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

When a JIRA issue has more than one component, file_churn < file_bug_churn + file_non_bug_churn (also affects file_bug_frequency and non_bug_frequency) #244

Closed carlosparadis closed 11 months ago

carlosparadis commented 11 months ago

This was observed on Camel 1.6 in a number of instances. I will use CAMEL-68 as an example. The issue actually lies in parse_jira, rather than the metric itself, so it took some time to identify the source of the problem:

https://github.com/sailuh/kaiaulu/blob/7566f4ef50a0cd55eff47eeade3d12f186d143f0/R/parser.R#L859

One issue may have one or more components. While this line of code accounts for multiple components, it does so returning an array of strings. One step here is missing: The array of strings has to be collapsed into a single string (e.g. using `stringi::stri_c(variable,collapse=";")) so the vector is coalesced into a single value.

Because this does not occur, we end up with a table where one of its cell can have 2 or more values. The R data.table default behavior to this is not to throw an error, but rather duplicate the rows.

So for example, if we were to request only CAMEL-68 issue, our resulting issue table, as it has two components (camel-core, and camel-spring), would duplicate the rows only modifying said value as follows (let's call it the jira_issues table obtained by parse_jira()):

issue_key issue_summary issue_type issue_status issue_components issue_description issue_created_datetimetz issue_updated_datetimetz issue_resolution_datetimetz issue_creator_id issue_creator_name issue_creator_timezone issue_assignee_id issue_assignee_name issue_assignee_timezone issue_reporter_id issue_reporter_name issue_reporter_timezone issue_resolution issue_tz
CAMEL-68 add a CamelContextAware interface to expose the desire to be injected with a CamelContext (like the ApplicationContextAware in Spring) New Feature Closed camel-core   2007-07-08T06:07:06.000+0000 2008-05-12T08:01:39Z 2007-08-13T19:12:33.000+0000 jstrachan James Strachan Etc/UTC       jstrachan James Strachan Etc/UTC Fixed 0
CAMEL-68 add a CamelContextAware interface to expose the desire to be injected with a CamelContext (like the ApplicationContextAware in Spring) New Feature Closed camel-spring   2007-07-08T06:07:06.000+0000 2008-05-12T08:01:39Z 2007-08-13T19:12:33.000+0000 jstrachan James Strachan Etc/UTC       jstrachan James Strachan Etc/UTC Fixed 0

When the metric module takes this type of table as input, it assumes every row is one unique issue. As such, to calculate metrics associated to issues, the project_git table (which every row is a commit with a potentially parsed ISSUE-ID) is left joined to the table above. If the assumption of 1 unique issue held, then the operation would only add the associated extra columns information to every commit. For example, we need the issue_status == CLOSED and the issue_type == BUG for the file_bug_churn. However, since we can have as many rows as there are components assigned to it in an issue, then for those cases, the commit rows are repeated to as many components as were reported in the issue.

This obviously will inflate the number of commits, and as a consequence the churn associated to them.

The file_churn metric does not rely on issue information (it merely looks at the lines_added and lines_removed from project_git. Hence, it is not affected by this. To continue on the example, on Camel 1.6 (which can be re-created using Kaiaulu project configuration of Camel and using the branch field as 1.6), we can observe the problem:

file_pathname file_churn file_bug_churn file_non_bug_churn
camel-core/src/main/java/org/apache/camel/CamelContextAware.java 33 0 66

Here, the actual file churn obtained by file_churn function (which does not rely on issue data) was 33. The only issue this file relates to is CAMEL-68, which is not a bug and s closed. Hence, the file_bug_churn is correct to be 0, as the issue involved is not a bug. However, note file_non_bug_churn ends up duplicated, because of the double row on the table above for CAMEL-68.

This leads to the condition of file_churn < file_bug_churn + file_non_bug_churn

Since the commits are being duplicated here, that means metrics such as file_bug_frequency are also compromised, as they rely on the count of commits table. We never implemented a bug_count metric equivalent to file_churn so the situation was never observed there due to that.

TL;DR

Currently, all 4 metrics file_bug_churn, file_non_bug_churn, file_bug_frequency and file_non_bug_frequency are inflated based on the number of components the issue had.

I am working on a set of unit tests for these functions, including for this case to prevent this to happen in the future. The #228 will greatly help prevent cases like this in the future, as this is fundamentally a unit test for a parser function that requires raw data to be evaluated.

carlosparadis commented 11 months ago

Bug is fixed on commit 1048efc with unit testing that captured the problem also passing :)

 > file_churn <- metric_file_churn(project_git)

> file_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]
                                                      file_pathname file_churn
1: camel-core/src/main/java/org/apache/camel/CamelContextAware.java         33

> file_bug_churn <- metric_file_bug_churn(project_git,jira_issues)
> file_non_bug_churn <- metric_file_non_bug_churn(project_git,jira_issues)
> file_bug_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]

Empty data.table (0 rows and 2 cols): file_pathname,file_bug_churn

> file_non_bug_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]
                                                      file_pathname file_non_bug_churn
1: camel-core/src/main/java/org/apache/camel/CamelContextAware.java                 33

Previously, the file above reported 66 due to the duplication of the issue so churn was counted twice.