sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 12 forks source link

Mapping and Renaming of 3rd Party Tools column names to Kaiaulu column names #243

Open carlosparadis opened 1 year ago

carlosparadis commented 1 year ago

When Kaiaulu started being written, parser_* functions tried to preserve the original tool column names, so it was easier for someone wanting to know more about the columns to just see the tools documentation for their definition. Now that Kaiaulu has multiple tools interface, where some may even collect the same data with different definitions, this becomes more complicated. In addition, some column names are just not clear or do not follow SE literature convention (e.g. the SCC tool calls the metric LOC as lines).

Because of that, we should come up with a consistent nomenclature for data we care about. Eventually, I hope this can be documented on a database schema such as a .mwb with all the relationships, but Kaiaulu GitHub wiki should be helpful to iterate on suggestions before moving changes to the API and Notebooks.

Also, the following suggestions were requested from @rnkazman on e-mail titled [SEWORLD] CFP: Information and Software Technology Special Issue on Application of causal modeling and inference methods in software engineering: Approaches, Challenges, State-of-the-Art and Prospects as a initial step towards the goal of this issue:

SCC

code -> loc

I will create a wiki page containing other column names obtained so we can have a final decision on what the other columns for SCC should be called hereafter too.

Motif Metrics

anti_motif_square -> anti_square_motif

Outcome Metrics

file_bug_frequency -> file_bug_changes (considering calling it instead file_bug_commit_count) file_non_bug_frequency -> file_nonbug_changes (considering calling it instead file_non_bug_commit_count)

No changes were requested for these two, but for consistency to the renaming above I was considering the following:

file_bug_churn -> file_bug_commit_churn file_non_bug_churn-> file_non_bug_commit_churn.

All the 4 metrics above involving bug and non bug are using a table of commits. What differs among them is if a) the commit links to a bug or not, and b) how we aggregate the commits, either by counting the number of rows, or sum of the churn column. Hence, I few the names explicitly including the word commit and the aggregator statistic (i.e. count or churn) makes it more explicit.

The docs should also make more clear the behavior of the code is to only to consider the counts of issues that contain an issue-id label. I.e. the function docs should more explicitly state this line of code in words on the function docs:

https://github.com/sailuh/kaiaulu/blob/7566f4ef50a0cd55eff47eeade3d12f186d143f0/R/metric.R#L87

Afterall, we can only determine if a commit is or is not a bug if we know what issue the commit refers to. If we don't, then we filter out.

rnkazman commented 1 year ago

Seems reasonable to me.