sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
18 stars 12 forks source link

Guidance on creating exec scripts #304

Open nicolehoess opened 2 months ago

nicolehoess commented 2 months ago

As we expanded Kaiaulu CLI, we came up with several questions that might be of interest to other developers working on the CLI as well. Therefore, we will sumarize them here for further discussion. :)

  1. Should we favor interfaces which perform larger analyses parts (e.g. the social smells example) or should we favor small interfaces with limited functionality (e.g. the mailing list downloader)?

  2. For a parallelized analysis that uses multiple cores to analyse multiple time windows of history, for example, should we have an additional interface without parallelisation (to avoid potential device-specific parallelization issues)?

  3. When analysing a project's history over several time windows, we can specify them explicitly or specify a window size and derive the windows automatically. Can one option be taken as default, e.g. taking the window size into account only if ranges are not provided?

  4. If there are multiple similar functions, e.g. for constructing temporal file-based and temporal entity-based collaboration networks, should we have a single interface with one or more parameters or create separate interfaces (including some duplicate code)? If we opt for an interface with several parameters, should we pass them via a configuration file?

  5. The outputs of some functions could be exported as tables, graphs, networks, etc. Which representation should we choose for the CLI?

  6. When updating existing configuration files, e.g. to demonstrate a new CLI, should the existing parameter choices be overwritten or should we create a new configuration file?

carlosparadis commented 1 month ago

Thank you for moving our discord exchange here. Adding my responses here too for future reference :)


(1) Should we favor interfaces which perform larger analyses parts (e.g. the social smells example) or should we favor small interfaces with limited functionality (e.g. the mailing list downloader)?

the rule of thumb I used for those were following the concept i saw on SNAP (the graph library from stanford). Basically exec does enough steps to do something useful to the user rather than give a CLI to build a pipeline. This will, of course, mean a lot of copy and pasting from the main code to exec, but seeing repetition there will help me better refine Kaiaulu API to reduce that (or maybe create a second layer of API functions for more simplified use, like ggplot has qplot).

(2) For a parallelized analysis that uses multiple cores to analyse multiple time windows of history, for example, should we have an additional interface without parallelisation (to avoid potential device-specific parallelization issues)?

This is very reasonable!

(3) When analysing a project's history over several time windows, we can specify them explicitly or specify a window size and derive the windows automatically. Can one option be taken as default, e.g. taking the window size into account only if ranges are not provided?

The window size of the config probably could use more thought put into it, so there is no issue there. You may just want to add to the docs of the exec (which you can do using the CLI library used on the others when the user types --help) to note that behavior or they would need to open it to see that behavior. In the future, I will probably try to improve the config file on that region and try to map some ideas on how codeface does stuff. The downloaders are now more or less "per file" anyway, so the notion of time windows may be probably be doable in a more structured manner.

(4) If there are multiple similar functions, e.g. for constructing temporal file-based and temporal entity-based collaboration networks, should we have a single interface with one or more parameters or create separate interfaces (including some duplicate code)? If we opt for an interface with several parameters, should we pass them via a configuration file?

For now go with what is easier for you. I am sure there will be some lessons learned after all is said and done on the code redundancy across files that will help inform the API that will help inform refactorings.

You have two options here I guess. One is making the choice a explicit parameter the user specify on the CLI as there is nowhere to put them on the project configuration file, these decisions on author file, commit-file etc.

The second is, if you want to experiment, you could try making a second config file, let's call it _analysis.conf in here you would include all these choices particular to your exec scripts in some sane manner. Maybe you could have a analysis_1, analysis_2. etc and the CLI would have some function for fishing the parameters from there. That would keep everything reproducible if in the future you want to remember what output types to what (would just need to match the analysis name name to the output file).

(5) The outputs of some functions could be exported as tables, graphs, networks, etc. Which representation should we choose for the CLI?

Up to you, think what would be the more helpful output! the execs should be outputing what is useful anyway!

(6) When updating existing configuration files, e.g. to demonstrate a new CLI, should the existing parameter choices be overwritten or should we create a new configuration file?

Go ahead and name them after the project and give another suffix. These will be excellent examples to identify where I should refactor the project config from an analysis config in the future.