Closed carlosparadis closed 11 months ago
Bug is fixed on commit 1048efc with unit testing that captured the problem also passing :)
> file_churn <- metric_file_churn(project_git)
> file_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]
file_pathname file_churn
1: camel-core/src/main/java/org/apache/camel/CamelContextAware.java 33
> file_bug_churn <- metric_file_bug_churn(project_git,jira_issues)
> file_non_bug_churn <- metric_file_non_bug_churn(project_git,jira_issues)
> file_bug_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]
Empty data.table (0 rows and 2 cols): file_pathname,file_bug_churn
> file_non_bug_churn[file_pathname == "camel-core/src/main/java/org/apache/camel/CamelContextAware.java"]
file_pathname file_non_bug_churn
1: camel-core/src/main/java/org/apache/camel/CamelContextAware.java 33
Previously, the file above reported 66 due to the duplication of the issue so churn was counted twice.
This was observed on Camel 1.6 in a number of instances. I will use CAMEL-68 as an example. The issue actually lies in
parse_jira
, rather than the metric itself, so it took some time to identify the source of the problem:https://github.com/sailuh/kaiaulu/blob/7566f4ef50a0cd55eff47eeade3d12f186d143f0/R/parser.R#L859
One issue may have one or more components. While this line of code accounts for multiple components, it does so returning an array of strings. One step here is missing: The array of strings has to be collapsed into a single string (e.g. using `stringi::stri_c(variable,collapse=";")) so the vector is coalesced into a single value.
Because this does not occur, we end up with a table where one of its cell can have 2 or more values. The R data.table default behavior to this is not to throw an error, but rather duplicate the rows.
So for example, if we were to request only CAMEL-68 issue, our resulting issue table, as it has two components (camel-core, and camel-spring), would duplicate the rows only modifying said value as follows (let's call it the
jira_issues
table obtained by parse_jira()):When the
metric
module takes this type of table as input, it assumes every row is one unique issue. As such, to calculate metrics associated to issues, theproject_git
table (which every row is a commit with a potentially parsed ISSUE-ID) is left joined to the table above. If the assumption of 1 unique issue held, then the operation would only add the associated extra columns information to every commit. For example, we need theissue_status == CLOSED
and theissue_type == BUG
for thefile_bug_churn
. However, since we can have as many rows as there are components assigned to it in an issue, then for those cases, the commit rows are repeated to as many components as were reported in the issue.This obviously will inflate the number of commits, and as a consequence the churn associated to them.
The
file_churn
metric does not rely on issue information (it merely looks at thelines_added
andlines_removed
fromproject_git
. Hence, it is not affected by this. To continue on the example, on Camel 1.6 (which can be re-created using Kaiaulu project configuration of Camel and using the branch field as 1.6), we can observe the problem:Here, the actual file churn obtained by
file_churn
function (which does not rely on issue data) was 33. The only issue this file relates to is CAMEL-68, which is not a bug and s closed. Hence, thefile_bug_churn
is correct to be 0, as the issue involved is not a bug. However, notefile_non_bug_churn
ends up duplicated, because of the double row on the table above for CAMEL-68.This leads to the condition of
file_churn
<file_bug_churn
+file_non_bug_churn
Since the commits are being duplicated here, that means metrics such as
file_bug_frequency
are also compromised, as they rely on the count of commits table. We never implemented abug_count
metric equivalent tofile_churn
so the situation was never observed there due to that.TL;DR
Currently, all 4 metrics
file_bug_churn
,file_non_bug_churn
,file_bug_frequency
andfile_non_bug_frequency
are inflated based on the number of components the issue had.I am working on a set of unit tests for these functions, including for this case to prevent this to happen in the future. The #228 will greatly help prevent cases like this in the future, as this is fundamentally a unit test for a
parser
function that requires raw data to be evaluated.