sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
19 stars 12 forks source link

Kaiaulu Module Major Refactoring #241

Open carlosparadis opened 1 year ago

carlosparadis commented 1 year ago

Currently, the download.R, parser.R, network.R have been growing indefinitely as more downloaders, parsers and transformations to network are added. Moreover, some conflicting in semantics can already be identified. For instance, the dv8.R contains parser functions, since they pertain to dv8 interfacing, but could also be reasonably expected to be found in parser.R. We also have a git.R module, but its gitlog parser exists in parser.R.

I am considering refactoring the 3 files based on the overall type of data they parse, download and represent network as, and maintain the distinction on the function purpose on its function name prefix only, as it is already done (i.e. download*, parse*, and transform_*). The following describes the tentative new organization:

git.R

Functions that rely on the existence of a .git folder (not github).

mail.R

All mailing list archives end up as .mbox files and do not offer much interface other than downloading its data.

Issue Trackers

Originally, I considered a issue.R would make more sense. However, the github.R file is fairly extensive, and has seen at least three use cases in Kaiaulu: a) analysis of events, b) analysis of bug count, c) analysis of social smells. The JIRA API also offers the same level of access for analysis, and bugzilla offers at least the bugs and communication analysis potential. Therefore, I believe issue.R is better reserved for functions that abstract these type of operations in a common interface, pulling from these more specialized modules to be future proof.

bugzilla.R

Bugzilla offer a quite extensive API. Although we do not currently use it extensively, I believe it makes sense to separate

jira.R

Kaiaulu downloader for JIRA is currently an R package, so the Notebook provides guidance on how to utilize it for downloading data, but it is not defined as part of an API.

github.R

This already exists.

src.R

These should contain functions associated to the examination of source files, which do not depend on the existence of a .git folder.

dv8.R

This already exists, but some parser.R functions should be moved to dv8.R

vulnerabilities.R

Functionality associated to the analysis of cve, cwe and capec

identity.R, graph.R, motif.R, gof.R, text.R

No changes to these modules. Their interfaces are fairly self contained.

metric.R

This already exists, but I believe smells.R would be better served being moved here.

interval.R

This file would likely be better renamed as series.R and could support time series analysis in Kaiaulu. Currently, this is not very clear on how could be done.

example.R

This should be fine as its own file for now, although it can be misleading as example files are normally example code for users to try the tool. May need a better name for this in the future.

Homeless Functions

Currently, the parse_reply and transform_reply_to_bipartite_network function would be homeless, as it serves as an abstraction to mail.R and jira.R. A reply.R file may make sense for communication related API in Kaiaulu, or maybe this level of abstraction in the API should not exist and users should refer to the Notebook for it. I will give more thought. Another function is parse_java_code_refactoring_json.

carlosparadis commented 1 year ago

Scenario 1: Adding /exec scripts.

On the longer term, we would like to have `/exec scripts that offer CLI to Kaiaulu's API most used features so they can be used server-side. The most obvious one here are the downloaders in Kaiaulu being set on a CRON job. In the current architecture, most scripts will pull from download.R, and possibly from parser.R. In the new architecture, every script will only require to go after its respective module. For example, a script to download JIRA issues, only relies on jira.R.

carlosparadis commented 1 year ago

Scenario 2: Adding optional DB capability

In the current architecture, a db.R file would need to be defined. The module would be similar to parser.R function, in that it would ingest from raw data, but rather than output a single table, it would instead generate several tables in a normalized manner. In the new architecture, the db_ functions would exist within every module. E.g. there would be a db_ function in jira.R, other in mail.R, etc. The logic of how to transform raw data for insertion in an optional database would be self contained to the tool it is in charge of.

Considering the parser.R module will continue to grow indefinitely in the current architecture, the db.R module would also suffer from the same issue. On the new architecture, however, this would be distributed across all the modules if the data was expected to be available on the database.

carlosparadis commented 12 months ago

Scenario 3: Unit Tests

The testthat package uses as convention test- and the file names. Much like the parser.R is getting bloated with parser functions, as consequence so does the test-parser.R. Therefore, the creation of more unit tests in the future in comparing both architecture organizations would be more clean and sane to navigate on the unit tests if they were test-git.R, test-jira.R, test-mail.R, test-bugzilla.R, etc than all combined in test-parser.R as it currently is.