sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

Create a fake data generator for jira issues and comments #228

Closed carlosparadis closed 11 months ago

carlosparadis commented 1 year ago

Let's use this issue for a fake jira issue data generator. We need to first agree on the endpoint that will serve as our example data. The prior issue was to create a JIRA issue crawler, but I will recycle the issue as the endpoints need is still the same.

Edit: Here's some more more guidance.

The JIRA API

At the time of this issue, JIRA has v2 and v3 (beta). Testing some endpoints of v3 against Apache JIRA, which is where most papers use data from, v2 works and v3 doesn't. So we will stick to v2.

v2 API docs is: https://developer.atlassian.com/cloud/jira/platform/rest/v2/intro/#about

Endpoints of interest

After choosing the API version, what we need to select what endpoints we care about for the fake data. This has to match what the future JIRA parser should implement. Let's first consider what Kaiaulu currently uses, which is the JirAgileR package.

The JirAgileR Package Issue and Comment Downloader

This is the call Kaiaulu does to the package in order to download issue data:

https://github.com/sailuh/kaiaulu/blob/cb2f4098d6aeedb1cf3721a4de75803fe97145f1/vignettes/download_jira_data.Rmd#L74-L88

In turn, this is the portion of JirAgileR code that takes those parameters:

https://github.com/matbmeijer/JirAgileR/blob/7626a419f8f9e19aa6d73bb65e7a5c1c7c4da26e/R/exports.R#L540-L546

  url<-httr::modify_url(url = url,
                        scheme = if(is.null(url$scheme)){"https"},
                        path = adapt_list(url$path, c("rest", "api", "latest", "search")),
                        query=list(jql=jql_query,
                                   fields=conc(fields),
                                   startAt = "0",
                                   maxResults = maxResults)

What we care about here is the line:

path = adapt_list(url$path, c("rest", "api", "latest", "search")),

This means that the library Kaiaulu relies on is using the /search endpoint. On the docs, this means this endpoint is being used:

https://developer.atlassian.com/cloud/jira/platform/rest/v2/api-group-issue-search/#api-rest-api-2-search-get

In theory, this means we could use the JirAgileR package to obtain sample data. However, a deficiency of the package to our case, is that it does not store the raw data, but parses it into a table. In Kaiaulu, raw data and parsed data are separate by design: If a new field is needed in the future and we have all the stored raw data, then we need not re-download the data again. Because of that, and since we have no JIRA crawler as substitute as it is, we will use the browser to obtain some sample data from Apache.

Generating Sample Data from Apache

If we look at the docs, we will see the URL format is:

curl --request GET \
  --url 'https://your-domain.atlassian.net/rest/api/2/search?jql=project%20%3D%20HSP' \
  --user 'email@example.com:<api_token>' \
  --header 'Accept: application/json'

What we need here is the first portion: https://your-domain.atlassian.net/rest/api/2/search?jql=project%20%3D%20HSP

And to get sample data, let's consider Apache Spark. A quick google search lands us in its JIRA project page: https://issues.apache.org/jira/projects/SPARK/issues

Thus we can construct the API URL as: https://issues.apache.org/jira/rest/api/2/search?jql=project=SPARK (I recommend using this on Firefox, as it auto formats the returning JSON nicely):

Screen Shot 2023-10-31 at 9 36 33 PM

We can see the total number of issues that SPARK has is over 45k, however, the maxResults parameter indicates it only could return 50. This is to preserve the Apache server bandwidth. With a crawler implementation, we would thus need to do "pagination", i.e. request 1 page at a time. But since we are only interested in creating fake data here, the returned JSON of this URL is what we are interested. Let's consider one more step to get the final data we want.

Specifying Time Ranges

I mentioned at the start of this semester that we would like to be able to set the code aside on the server to eventually keep downloading data. This means our data request has to somehow know of some "checkpoint" of the data we have, so we don't have to re-download all the issues every time we want to update. A reasonable way to do so is to simply ask the JIRA API to "only download issues after date X". A quick google search led me to this Stackoverflow question:

https://stackoverflow.com/questions/36450428/how-to-query-through-the-date-range-by-jira-rest-api

Using the example above for Apache Spark we then get:

(*) https://issues.apache.org/jira/rest/api/2/search?jql=project=SPARK AND created >= 2021-01-01 AND created <= 2021-01-02

In this query, we see maxResults is still 50 as default, but our total is now just 3 issues. We have thus succeeded in getting a sample JSON file we can use for our fake data generator.

Specifying Issue Ranges

An alternative to specify date ranges, is to specifying issue ranges. These are more precise:

https://support.atlassian.com/jira-software-cloud/docs/jql-fields/#Issue-key

The query of time ranges returned SPARK-33953 to SPARK-33955:

The equivalent query using Issue Key is thus: https://issues.apache.org/jira/rest/api/2/search?jql=project=SPARK AND issueKey >= SPARK-33953 AND issueKey <= SPARK-33955

Let's use the issueKey query to construct our example, since we have more control over what we are getting. In particular, we should construct a query that returns an empty value, and one that returns 1 value. With that, you have sufficient examples of the format of the return data for 0, 1 and "n" issues.

I'd also recommend you keep the browser interface of the data you are getting on hand. It makes it easier to reason the data afterall. For example, contrast the associated issue data from the query above to:

https://issues.apache.org/jira/browse/SPARK-33953?filter=-4

Parameters of Interest

What is next is to use this sample data to construct a function that can create example files. The approach is similar to how Codeface does it: https://github.com/lfd/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/test/integration/gitproject.py#L213-L242

You simply copy the JSON portion associated to 0 and/or 1 issue, specify the parameters, and then encode in the function logic how to create more than one issue.

What fields we care about being able to edit? The ones Codeface edits is a good starting point, and also the ones used in our Git Fake Data. So the issue creation date, the issue type, the author name and e-mail, etc.

Once the json string is formatted, you can use jsonlite::write_json to write it to disk.

Issue Comments

The second part of this task is to do the same as above, but to obtain comments data.

If you look at Kaiaulu Notebook:

https://github.com/sailuh/kaiaulu/blob/cb2f4098d6aeedb1cf3721a4de75803fe97145f1/vignettes/download_jira_data.Rmd#L109-L122

You will see the comment field has to be explicitly specified so it is included. You can check the list of all fields used in Apache Spark using this:

Moreover you can see the same function from JirAgileR is being used. Indeed, the field is one of the parameters of the /search endpoint:

Screen Shot 2023-10-31 at 10 25 50 PM

So if we format our query accordingly:

https://issues.apache.org/jira/rest/api/2/search?fields=comment&jql=project=SPARK AND issueKey >= SPARK-33953 AND issueKey <= SPARK-33955

Note the emphasis on what I just added there is fields=comment&. With it, we can get the sample data with comments. One more time, it is useful to check the actual browser interface to make sense of the comment field:

https://issues.apache.org/jira/browse/SPARK-33954

Indeed, in this case you will see it is even a bot that is posting the comment on the author behalf.

We will need a separate function that creates fake data including the comments field, and is able to modify the information above for issues, but also for comments.

waylonho commented 1 year ago

Some reference links I was looking at: https://developer.atlassian.com/server/jira/platform/rest-apis/ https://developer.atlassian.com/cloud/jira/platform/rest/v3/intro/#experimental

Still quite unfamiliar with all the steps for making the crawler but so far my general thinking is:

I will develop more specific explanations as time goes on and the project progresses.

waylonho commented 1 year ago

Sorry these are last minute. After a while of going through the download.R:

Does the current code account for an error case? Also, I see that the GET request is specifically by date ("creation_time"), should this be the same for the JIRA crawler?

https://github.com/sailuh/kaiaulu/blob/cb2f4098d6aeedb1cf3721a4de75803fe97145f1/R/download.R#L45-L46

Also, I'm not sure I fully understand the perceval functions. I get what it's doing (I think), but how is this functionally different from the other functions (the non perceval ones)?

https://github.com/sailuh/kaiaulu/blob/cb2f4098d6aeedb1cf3721a4de75803fe97145f1/R/download.R#L133C1-L142C2

carlosparadis commented 1 year ago

Does the current code account for an error case?

I do not believe the prior group added much in the way of checking for errors. It is a nice addition to have, but not necessary as a minimum deliverable.

Also, I see that the GET request is specifically by date ("creation_time"), should this be the same for the JIRA crawler?

What parameters you can pass on the function is subject to what API endpoints it offers. If JIRA API does not offer a "request issue by date range", then you can't write a function requesting it. I suggest you study the available endpoints for obtaining issues and comments and list them here as proposal to write functions like the one you cited. For example, can we obtain said data by date? Is there any other information we can pass to obtain them? (e.g. issue id range, etc).

Bugzilla had a particularly annoying limitation on not being able to fully specify the time range if memory serves me right.

Also, I'm not sure I fully understand the perceval functions. I get what it's doing (I think), but how is this functionally different from the other functions (the non perceval ones)?

Thank you for noting down the confusion. This means the bugzilla Notebook needs more documentation and/or the R functions :) There were 3 crawler functions requested to the group: Two interfaced with Perceval, and one was built-in to Kaiaulu. If you see the Perceval docs:

https://github.com/chaoss/grimoirelab-perceval#bugzilla

You will see why it has two endpoints. It used to be the case Bugzilla, the issue tracker, originally did not offer an API. Meaning, you had to literally download the html pages, parse html, and get the information out. Some OSS projects still use an old version of Bugzilla, so Perceval (the MSR tool) will do that for you.

The second perceval option is for the more current bugzilla, the one that offers the API. So, different interfaces, different commands to obtain the data.

Then we have the third option in Kaiaulu, which is a built-in Bugzilla crawler. Now a fair question you may have is "Why reinvent the wheel?". The answer is simple: Flexibility in modifying the crawler that is ours. For example, at the time the group found out you had to manually specify a parameter for Perceval Bugzilla crawler, which was very hard to figure out. so in the group function they made it so based on the first downloaded data, the function could figure out so long the user gave an upper bound:

https://github.com/sailuh/kaiaulu/blob/cb2f4098d6aeedb1cf3721a4de75803fe97145f1/R/download.R#L13-L16

Specifying it wrongly, could lead to incomplete data being downloaded. Having the local bugzilla implementation allowed us to write a bit of code logic to figure that out and no longer ask the user.

This is also why Milestone 2 exists here. We currently rely on a JIRA crawler R package. Unfortunately, it has some flaws, and building our own we would be better off.

This also leads into Milestone 3, which is modifying some of our downloaders to being able to add to existing data, rather than consider every data download request independently. We need a "refresh" capability.

carlosparadis commented 1 year ago

@waylonho I've updated my first comment with a comprehensive discussion of how to use the JIRA API, so that your task is more or less equivalent to Ruben's. I.e. get the data, and write the fake generator.

With the fake generator in place, you will do the same as you did for Git: Write some example functions in example.R so you can call parse_jira for unit testing. Of course, you currently do not have a parse_jira function to call on the unit test. You should nonetheless write the test functions, and create a mock-up parse_jira function that just returns true for now. If I have a chance, I will write the function before you get to the point of needing it.

carlosparadis commented 1 year ago

So I don't forget: I added a example starting point here you can refactor (much as you did to git_create_sample_log) into a set of functions, etc.

https://github.com/sailuh/kaiaulu/blob/master/R/jira.R

we also already have a parse_jira() and the ability to save JSONs using the library. So this should be much easier to do and to test.

carlosparadis commented 11 months ago

Since we now have a functional jira fake issue generator which can test parse_jira, I am closing this issue. I am making a new one with some conclusions of the final meeting findings from this effort, but this is obviously for future work.

Thanks for the PR!