sailuh / kaiaulu

An R package for mining software repositories
http://itm0.shidler.hawaii.edu/kaiaulu
Mozilla Public License 2.0
20 stars 13 forks source link

Refresh Capabilities for Bugzilla Issue Downloader (Milestone 2) #285

Open anthonyjlau opened 8 months ago

anthonyjlau commented 8 months ago

1. Purpose

The purpose of this issue is to create a refresh capability for the Bugzilla downloader and parser. This means updating the download_bugzilla_rest_issue_comments and parse_bugzilla_rest_issues_comments function. These will be used in a refresh_bugzilla_issues_comments function so that Bugzilla issues can be constantly updated by a Cron job.

2. Process

I will be using mostly existing code to base my changes on. I will be updating download_bugzilla_rest_issue_comments by adding a comments parameter which allows the user to download issues with or without comments. I also added a verbose parameter for more details on the execution status. I also separated the formation of the query and the API call into two different functions.

The download_bugzilla_rest_issue_comments now takes in a parameter called query, which is a REST API query that it uses to form an API call and download the data into a json file.

Then, I created a new function called download_bugzilla_rest_issue_comments_by_date which takes in a start_timestamp parameter and forms a REST API query with that timestamp. It then calls download_bugzilla_rest_issue_comments in a loop until there are no more issues on the page.

To make the refresher, I will use the function refresh_bugzilla_issues_comments which will check what the save folder path for files. If there are files, then it will find the most recent issue by using the parse_bugzilla_latest_date function. Then, it will download the issues between the most recent issue created date and today. If there are no files in the save folder path, it will download ALL issues for the Bugzilla page.

3. Endpoints

The endpoint that is being used is /rest/bug. The query also includes creation_time, limit, and offset.

The creation_time query is specific to the second. This means it is possible to download duplicate issues if they were created on the same second. However, it is extremely unlikely that this will happen so it is ok.

If the user wants to download issues and comments (comments = TRUE), then the query will also include include_fields=_default,comments. For more information on the include_fields query, you can go here.

4. Task List

5. Unfinished Parts

There are a few important parts that I was unable to get to. The issues with comments parser does not work with websites other than Red Hat. More information about this is found on issue #299.

Another unfinished task was moving away from the offset query and instead changing the start_timestamp query. More information about this is found on issue #300.

Relevant information can also be found here.

anthonyjlau commented 7 months ago

As discussed on call, I will be combining the two functions that download issues and issues with comments into one function. I will be adding a parameter to the download_bugzilla_rest_issue_comments called "comments." When it is set to TRUE, it will download issues with comments. When it is set to FALSE, it will download only issues. I will also be adding a "verbose" parameter to print out execution details. When it is set to TRUE, more information about the execution will be printed out. When it is set to FALSE, it won't do that.

anthonyjlau commented 7 months ago

API endpoints: https://bugzilla.readthedocs.io/en/latest/api/core/v1/bug.html#search-bugs

Look through this or elsewhere to find the documentation about the comments endpoint. Can also look through old pull requests and issues to find the comments endpoint.

Ssunoo2 commented 7 months ago

Refer the this comment here for downloader logic overlap: https://github.com/sailuh/kaiaulu/issues/290#issuecomment-2067516774

anthonyjlau commented 7 months ago

I found some information about how include_fields is used: https://bugzilla.readthedocs.io/en/latest/api/core/v1/general.html#useful-parameters

Here is an example API call for reference: rest/bug?creation_time=2024-01-01T00:00:00Z&include_fields=_default,comments&limit=0&offset=0

It seems that the include_fields parameter can be used with the /bug endpoint, even though on the website, it is used with the /user endpoint. In our example, include_fields takes in the _default value which means that it returns the default values of an API call. The default return values from the /bug endpoint is the list of bugs with all of the information about each bug.

include_fields also seems to be able to take in other endpoints. This means that we can make API calls to different endpoints by just adding it to the include_fields parameter. So for our example, the include_fields=comments is the same as making an API call to /rest/bug/comments. The comments are added to each bug as a list of information about the comments.

The API call above combines _default and comments which means it returns the list of bugs and their comments all in one go.

I could not find any of this on the documentation website so I don't know how the other group found this out. I think they are wizards or something... :)

carlosparadis commented 7 months ago

Thank you for digging this out. It's amazing how hard it is to find information and how things interact in strange ways!

anthonyjlau commented 7 months ago

@carlosparadis I am working on making the generic API call function that takes a query parameter (download_bugzilla_rest_issues_comments). However, there is a problem with making the query a parameter. In JIRA and GitHub, the API query does not change once it is made. In Bugzilla, the limit and offset query is changed for every API call to make the pages.

Example API calls:

/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=0
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=20
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=40
...

To fix this, I am going to update the query in a loop but I was wondering where I should put the loop. There are two ways to update it.

One way:

download_bugzilla_rest_issues_comments() {
  repeat {
    # update query
    # make api call
    # download, name, and write files
  }
}

download_bugzilla_rest_issues_comments_by_date() {
  # setup query
  # call download_bugzilla_rest_issues_comments
}

For this one, the loop is in the generic function.

Another way:

download_bugzilla_rest_issues_comments() {
  # make api call
  # download, name, and write files
}

download_bugzilla_rest_issues_comments_by_date() {
  repeat {
    # change query
    # call download_bugzilla_rest_issues_comments
  }
}

Here, the loop is in the by_date function.

I was wondering which way is the best way to do it. Let me know if you need me to go into more detail about this.

carlosparadis commented 7 months ago

Just so I am clear, the off-set is already handled by the existing bugzilla function right? The question is where you will place it?

From what you said, the more future-proof way is the second option to be modifying the query (i.e. download_bugzilla_rest_issues_comments_by_date).

If the rationale I provided sounds contradictory to the change you will make, I likely misunderstood the options, so feel free to iterate further with me if you feel is needed or you see a pro/con I did not enumerate.

anthonyjlau commented 7 months ago

I see, so I will be using the second option then. Also yes, the function handles moving the offset parameter.

anthonyjlau commented 6 months ago

I recently found out that when you run the downloader on Bugzilla sites other than Red Hat, it returns a differently formatted .json file. The biggest difference is that Red Hat's json file has a 'limit', 'offset', and 'total_matches' field where as other project sites do not. For example, when you run this API call for Red Hat, the 3 fields mentioned above are listed at the bottom of the json file. When compared with this API call for GCC, the 3 fields are not in the json file. The fields are also missing for this Yocto Project API call and this Mozilla API call

Another difference between Red Hat and other Bugzilla websites is the inconsistency of comments. The format of the comments field in the Red Hat API call is different from the format of the comments field in the Mozilla API call. For example, the Red Hat API call has a 'creator_id' field in the comments whereas the Mozilla API call does not have that.

The first problem is the motivation for issue #300. Since offset is not being returned by all of the json files, it is confusing that it is being used in the API call. To make the API calls have more intuitive sense and to match how other API calls are made (GitHub), we want to change the API call to modify the start_timestamp instead of the offset.

The second problem is the reason why issue #299 exists. Since there are different comment fields in different API calls, the Bugzilla issues comments parser does not work for all json formats. More research on how the comment field looks like on different Bugzilla websites would help to figure out the solution to this.

anthonyjlau commented 6 months ago

Due to the missing limit, offset, and total_matches fields, I changed the way that the downloader does its checks. This means that I changed the way the downloader knows how many issues were downloaded and I changed the way the limit is dynamically changed.

Before, the downloader would use the total_matches field from the Red Hat json file on every page as it downloaded to check how many issues were on the page. Since other projects other than redhat in their json files do not return that field, I changed it to check the length of the bugs list obtained from the json file. If the length of the list is 0, there are no issues on the page and it will end the loop. The length of bugs and total_matches are the same because when the offset is high enough, there will be 0 total_matches. For example, if there are 70 total issues and I make the API call:

/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=80

When the offset is above the total issues, there are no issues to be returned so the total_matches will be 0. This also means that the bugs list that is returned has a length of 0. This is why using the length of the bugs list is the same as using total_matches.

To change the limit, the downloader used to check if the limit was equal to the limit_upperbound parameter. Since there is no limit field now, the downloader uses the length of the bugs list to change the limit. They are the same because the max number of issues that you are allowed to download is the number of issues that are returned. So, the limit value must be the same as the length of the bugs list.

carlosparadis commented 6 months ago

We could also not use the limit and offset, and just instead rely on the timestamp which may be more sane. #300