sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

Filtering Git History #306

Closed BenjyNStrauss closed 3 months ago

BenjyNStrauss commented 3 months ago

Apologies if this is in the docs, but how does one use Kaiaulu to filter the git history? (and how does Kaiaulu do this?) I'm trying to figure out which commits were due to bugs specifically.

Thanks it advance, -BNS

carlosparadis commented 3 months ago

I believe there is still a misconception of what Kaiaulu is: Kaiaulu has a function called parse_gitlog() that gives you a table you can manipulate in R. How you go about filter a git history is how one would go about subsetting a table in R using the R data.table library. I'd suggest you check the data.table docs on performing table subsetting. You could also just save the table using fwrite to manipulate in excel or other language. The Notebooks will provide you explains on how to use Kaiaulu R functions generally related to git analysis.

Kaiaulu, as an R package, just provides you a set of convenience functions that gives you tables in R. You will still have to use R to manipulate the data and build your own pipelines.

BenjyNStrauss commented 3 months ago

I see. I was told by Dr Kazman that Kaiaulu had this feature. Unless you or I was misunderstanding what he was asking. His words were "You can determine whether a commit is related to a bug by looking at the issue type in the associated issue. FYI, Kaiaulu will do this for you."

Where would I find the issue number? Is that in the commit message? And if so, is it the commit message? I don't see a column for it in the table we worked to generate. Sorry I'm a bit confused.

carlosparadis commented 3 months ago

@rnkazman is right that "kaiaulu will do this for you", but not in one click or function call. It will provide you a toolbox of functions to help you assemble your pipeline, not give it pre-made for you.

There is actually a lot that you need to work out of Rick''s statement alone there. For instance:

The rest of the pipeline you have to build off the tables. Where Kaiaulu helps you is get the data as tables for you. The notebooks give you examples of how to weave the tables together. However, for some specifics, you will have to write your own pipeline on top of the functions.

BenjyNStrauss commented 3 months ago

I think I understand what you're saying. Thanks for the explanation. Judging by your response, I also may be looking for the Kaiaulu docs in the wrong place.

Maven seems to have its own issue tracker.

Given my unfamiliarity with R (and by extension Kaiaulu), I'm going to have to play around with this a bit.

carlosparadis commented 3 months ago

@BenjyNStrauss

The workflow I recommend you follow is:

  1. Start on Maven website to find what are the current used systems. For example, looking at Maven website you will find they use JIRA: https://maven.apache.org/ref/3.5.0-alpha-1/issue-tracking.html

It is not always the case that an open-source project has all its life in one issue tracker or version control system. Depending on the analysis you are doing, this may compromise your assumptions (e.g. issue id regex), etc.

Then I would recommend to write a project configuration file to document your findings (and also for reproducibility sake). Looking at the conf folder I don't see one: https://github.com/sailuh/kaiaulu/tree/master/conf

I'd recommend you PR one to Kaiaulu, and then I can review if it is correct and accept into the repo.

  1. You know you need Git and JIRA. You also want to compute a Metric (bug count). Look for those keywords on the api docs: http://itm0.shidler.hawaii.edu/kaiaulu/reference/index.html

And you will find functions and Notebooks about them. Specifically to what you asked, there is even a Notebook called Bug Count, that explains how to link the messages:

http://itm0.shidler.hawaii.edu/kaiaulu/articles/bug_count.html

Kaiaulu Notebooks were written with the intent of explaining how things are done in SE, too. So it may be helpful to look at them as being study material. Even if not related to your task at hand, it may help you get a better understanding on how things are done and the tool supports you.

BenjyNStrauss commented 3 months ago

I'm sorry, I'm not entirely sure what you're recommending. I'm not familiar at all with Jira, in fact I've never even heard of it before today.

Maven seems to use Jira, it's tracker is "https://issues.apache.org/jira/projects/MNG/issues/MNG-6956?filter=allopenissues". It's too big to download as a .csv file since there are over 6K issues, and the max a CSV can handle is 1K.

What do you mean by "PR one to Kaiaulu"?

Can Kaiaulu download the csv from Maven (bypassing the limit)?

carlosparadis commented 3 months ago

@BenjyNStrauss

I recommend you start by trying to look up what JIRA is first, or more broadly issue trackers. Try to look at the JIRA Notebook in Kaiaulu, and get comfortable with how to download data using it. You are correct you can't download from the page directly, this is why there is a JIRA interface in Kaiaulu.

As for the "PR one to Kaiaulu", I meant sending a pull request of Maven's project configuration file.

But to stay in the scope of this question: Why don't you try to use the Git notebook first and see if you understand what it is doing to try to subset a git history. Moving forward to other data sources before doing git subsetting will likely be too much if you are just getting started on learning about the open source software development ecosystem.

If you would like to ask other questions on JIRA or other data sources, I'd appreciate if you can create a separate discussion question.

BenjyNStrauss commented 3 months ago

I will try looking at the notebook. (I didn't realize it existed) Thank you for your time, and for helping while I (figuratively) stumble around in the dark.