When a JIRA issue has more than one component, file_churn < file_bug_churn + file_non_bug_churn (also affects file_bug_frequency and non_bug_frequency)

This was observed on Camel 1.6 in a number of instances. I will use CAMEL-68 as an example. The issue actually lies in parse_jira, rather than the metric itself, so it took some time to identify the source of the problem:

https://github.com/sailuh/kaiaulu/blob/7566f4ef50a0cd55eff47eeade3d12f186d143f0/R/parser.R#L859

One issue may have one or more components. While this line of code accounts for multiple components, it does so returning an array of strings. One step here is missing: The array of strings has to be collapsed into a single string (e.g. using `stringi::stri_c(variable,collapse=";")) so the vector is coalesced into a single value.

Because this does not occur, we end up with a table where one of its cell can have 2 or more values. The R data.table default behavior to this is not to throw an error, but rather duplicate the rows.

So for example, if we were to request only CAMEL-68 issue, our resulting issue table, as it has two components (camel-core, and camel-spring), would duplicate the rows only modifying said value as follows (let's call it the jira_issues table obtained by parse_jira()):

issue_key	issue_summary	issue_type	issue_status	issue_components	issue_description	issue_created_datetimetz	issue_updated_datetimetz	issue_resolution_datetimetz	issue_creator_id	issue_creator_name	issue_creator_timezone	issue_assignee_id	issue_assignee_name	issue_assignee_timezone	issue_reporter_id	issue_reporter_name	issue_reporter_timezone	issue_resolution	issue_tz
CAMEL-68	add a CamelContextAware interface to expose the desire to be injected with a CamelContext (like the ApplicationContextAware in Spring)	New Feature	Closed	camel-core		2007-07-08T06:07:06.000+0000	2008-05-12T08:01:39Z	2007-08-13T19:12:33.000+0000	jstrachan	James Strachan	Etc/UTC				jstrachan	James Strachan	Etc/UTC	Fixed	0
CAMEL-68	add a CamelContextAware interface to expose the desire to be injected with a CamelContext (like the ApplicationContextAware in Spring)	New Feature	Closed	camel-spring		2007-07-08T06:07:06.000+0000	2008-05-12T08:01:39Z	2007-08-13T19:12:33.000+0000	jstrachan	James Strachan	Etc/UTC				jstrachan	James Strachan	Etc/UTC	Fixed	0

When the metric module takes this type of table as input, it assumes every row is one unique issue. As such, to calculate metrics associated to issues, the project_git table (which every row is a commit with a potentially parsed ISSUE-ID) is left joined to the table above. If the assumption of 1 unique issue held, then the operation would only add the associated extra columns information to every commit. For example, we need the issue_status == CLOSED and the issue_type == BUG for the file_bug_churn. However, since we can have as many rows as there are components assigned to it in an issue, then for those cases, the commit rows are repeated to as many components as were reported in the issue.

This obviously will inflate the number of commits, and as a consequence the churn associated to them.

The file_churn metric does not rely on issue information (it merely looks at the lines_added and lines_removed from project_git. Hence, it is not affected by this. To continue on the example, on Camel 1.6 (which can be re-created using Kaiaulu project configuration of Camel and using the branch field as 1.6), we can observe the problem:

file_pathname	file_churn	file_bug_churn	file_non_bug_churn
camel-core/src/main/java/org/apache/camel/CamelContextAware.java	33	0	66

Here, the actual file churn obtained by file_churn function (which does not rely on issue data) was 33. The only issue this file relates to is CAMEL-68, which is not a bug and s closed. Hence, the file_bug_churn is correct to be 0, as the issue involved is not a bug. However, note file_non_bug_churn ends up duplicated, because of the double row on the table above for CAMEL-68.

This leads to the condition of file_churn < file_bug_churn + file_non_bug_churn

Since the commits are being duplicated here, that means metrics such as file_bug_frequency are also compromised, as they rely on the count of commits table. We never implemented a bug_count metric equivalent to file_churn so the situation was never observed there due to that.

TL;DR

Currently, all 4 metrics file_bug_churn, file_non_bug_churn, file_bug_frequency and file_non_bug_frequency are inflated based on the number of components the issue had.

I am working on a set of unit tests for these functions, including for this case to prevent this to happen in the future. The #228 will greatly help prevent cases like this in the future, as this is fundamentally a unit test for a parser function that requires raw data to be evaluated.

sailuh / kaiaulu

When a JIRA issue has more than one component, file_churn < file_bug_churn + file_non_bug_churn (also affects file_bug_frequency and non_bug_frequency) #244

TL;DR