Open anthonyjlau opened 8 months ago
As discussed on call, I will be combining the two functions that download issues and issues with comments into one function. I will be adding a parameter to the download_bugzilla_rest_issue_comments
called "comments." When it is set to TRUE, it will download issues with comments. When it is set to FALSE, it will download only issues. I will also be adding a "verbose" parameter to print out execution details. When it is set to TRUE, more information about the execution will be printed out. When it is set to FALSE, it won't do that.
API endpoints: https://bugzilla.readthedocs.io/en/latest/api/core/v1/bug.html#search-bugs
Look through this or elsewhere to find the documentation about the comments endpoint. Can also look through old pull requests and issues to find the comments endpoint.
Refer the this comment here for downloader logic overlap: https://github.com/sailuh/kaiaulu/issues/290#issuecomment-2067516774
I found some information about how include_fields is used: https://bugzilla.readthedocs.io/en/latest/api/core/v1/general.html#useful-parameters
Here is an example API call for reference:
rest/bug?creation_time=2024-01-01T00:00:00Z&include_fields=_default,comments&limit=0&offset=0
It seems that the include_fields
parameter can be used with the /bug endpoint, even though on the website, it is used with the /user endpoint. In our example, include_fields
takes in the _default value which means that it returns the default values of an API call. The default return values from the /bug endpoint is the list of bugs with all of the information about each bug.
include_fields
also seems to be able to take in other endpoints. This means that we can make API calls to different endpoints by just adding it to the include_fields
parameter. So for our example, the include_fields=comments
is the same as making an API call to /rest/bug/comments. The comments are added to each bug as a list of information about the comments.
The API call above combines _default and comments which means it returns the list of bugs and their comments all in one go.
I could not find any of this on the documentation website so I don't know how the other group found this out. I think they are wizards or something... :)
Thank you for digging this out. It's amazing how hard it is to find information and how things interact in strange ways!
@carlosparadis I am working on making the generic API call function that takes a query parameter (download_bugzilla_rest_issues_comments
). However, there is a problem with making the query a parameter. In JIRA and GitHub, the API query does not change once it is made. In Bugzilla, the limit and offset query is changed for every API call to make the pages.
Example API calls:
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=0
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=20
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=40
...
To fix this, I am going to update the query in a loop but I was wondering where I should put the loop. There are two ways to update it.
One way:
download_bugzilla_rest_issues_comments() {
repeat {
# update query
# make api call
# download, name, and write files
}
}
download_bugzilla_rest_issues_comments_by_date() {
# setup query
# call download_bugzilla_rest_issues_comments
}
For this one, the loop is in the generic function.
Another way:
download_bugzilla_rest_issues_comments() {
# make api call
# download, name, and write files
}
download_bugzilla_rest_issues_comments_by_date() {
repeat {
# change query
# call download_bugzilla_rest_issues_comments
}
}
Here, the loop is in the by_date function.
I was wondering which way is the best way to do it. Let me know if you need me to go into more detail about this.
Just so I am clear, the off-set is already handled by the existing bugzilla function right? The question is where you will place it?
From what you said, the more future-proof way is the second option to be modifying the query (i.e. download_bugzilla_rest_issues_comments_by_date).
download_bugzilla_rest_issues_comments()
to trully be a query the user has full control of the parameters, rather than "something user doesn't see is tampering with their query". download_bugzilla_rest_issues_comments_by_date()
is "one way the user could create a query". download_bugzilla_rest_issues_comments()
is compatible with them: It just cares about taking the query and issuing the command. Meanwhile every new function is self contained too. Arguably there is a bit of redundancy in this manner in that the logic of the offset ends up duplicated throughout any new function like by_date()
but if any function in the future ever needs to modify said logic, the flexibility is there.If the rationale I provided sounds contradictory to the change you will make, I likely misunderstood the options, so feel free to iterate further with me if you feel is needed or you see a pro/con I did not enumerate.
I see, so I will be using the second option then. Also yes, the function handles moving the offset parameter.
I recently found out that when you run the downloader on Bugzilla sites other than Red Hat, it returns a differently formatted .json file. The biggest difference is that Red Hat's json file has a 'limit', 'offset', and 'total_matches' field where as other project sites do not. For example, when you run this API call for Red Hat, the 3 fields mentioned above are listed at the bottom of the json file. When compared with this API call for GCC, the 3 fields are not in the json file. The fields are also missing for this Yocto Project API call and this Mozilla API call
Another difference between Red Hat and other Bugzilla websites is the inconsistency of comments. The format of the comments field in the Red Hat API call is different from the format of the comments field in the Mozilla API call. For example, the Red Hat API call has a 'creator_id' field in the comments whereas the Mozilla API call does not have that.
The first problem is the motivation for issue #300. Since offset is not being returned by all of the json files, it is confusing that it is being used in the API call. To make the API calls have more intuitive sense and to match how other API calls are made (GitHub), we want to change the API call to modify the start_timestamp instead of the offset.
The second problem is the reason why issue #299 exists. Since there are different comment fields in different API calls, the Bugzilla issues comments parser does not work for all json formats. More research on how the comment field looks like on different Bugzilla websites would help to figure out the solution to this.
total matches
: the number of issues that meet the requirements that are part of the query (e.g. all issues that are at or after the created date, i.e. that follow the query)limit
the limit of the API call: even if the total match is 100, but limit is 20, then the page of the json will only have 20 issuesoffset
: where the starting point of the page should be in issues: if the offset is 10, and i got 20 issues back, then the returned issues start at the 11th. Due to the missing limit
, offset
, and total_matches
fields, I changed the way that the downloader does its checks. This means that I changed the way the downloader knows how many issues were downloaded and I changed the way the limit is dynamically changed.
Before, the downloader would use the total_matches
field from the Red Hat json file on every page as it downloaded to check how many issues were on the page. Since other projects other than redhat in their json files do not return that field, I changed it to check the length of the bugs
list obtained from the json file. If the length of the list is 0, there are no issues on the page and it will end the loop. The length of bugs
and total_matches
are the same because when the offset is high enough, there will be 0 total_matches
. For example, if there are 70 total issues and I make the API call:
/rest/bug?created_date=2024-01-01T00:00:00Z&limit=20&offset=80
When the offset is above the total issues, there are no issues to be returned so the total_matches
will be 0. This also means that the bugs
list that is returned has a length of 0. This is why using the length of the bugs
list is the same as using total_matches
.
To change the limit
, the downloader used to check if the limit
was equal to the limit_upperbound
parameter. Since there is no limit
field now, the downloader uses the length of the bugs
list to change the limit. They are the same because the max number of issues that you are allowed to download is the number of issues that are returned. So, the limit
value must be the same as the length of the bugs
list.
We could also not use the limit and offset, and just instead rely on the timestamp which may be more sane. #300
1. Purpose
The purpose of this issue is to create a refresh capability for the Bugzilla downloader and parser. This means updating the
download_bugzilla_rest_issue_comments
andparse_bugzilla_rest_issues_comments
function. These will be used in arefresh_bugzilla_issues_comments
function so that Bugzilla issues can be constantly updated by a Cron job.2. Process
I will be using mostly existing code to base my changes on. I will be updating
download_bugzilla_rest_issue_comments
by adding acomments
parameter which allows the user to download issues with or without comments. I also added averbose
parameter for more details on the execution status. I also separated the formation of the query and the API call into two different functions.The
download_bugzilla_rest_issue_comments
now takes in a parameter calledquery
, which is a REST API query that it uses to form an API call and download the data into a json file.Then, I created a new function called
download_bugzilla_rest_issue_comments_by_date
which takes in astart_timestamp
parameter and forms a REST API query with that timestamp. It then callsdownload_bugzilla_rest_issue_comments
in a loop until there are no more issues on the page.To make the refresher, I will use the function
refresh_bugzilla_issues_comments
which will check what the save folder path for files. If there are files, then it will find the most recent issue by using theparse_bugzilla_latest_date
function. Then, it will download the issues between the most recent issue created date and today. If there are no files in the save folder path, it will download ALL issues for the Bugzilla page.3. Endpoints
The endpoint that is being used is
/rest/bug
. The query also includescreation_time
,limit
, andoffset
.The
creation_time
query is specific to the second. This means it is possible to download duplicate issues if they were created on the same second. However, it is extremely unlikely that this will happen so it is ok.If the user wants to download issues and comments (comments = TRUE), then the query will also include
include_fields=_default,comments
. For more information on theinclude_fields
query, you can go here.4. Task List
5. Unfinished Parts
There are a few important parts that I was unable to get to. The issues with comments parser does not work with websites other than Red Hat. More information about this is found on issue #299.
Another unfinished task was moving away from the offset query and instead changing the start_timestamp query. More information about this is found on issue #300.
Relevant information can also be found here.