JIRA Downloader (Milestone 1)

1. Purpose

The purpose of this issue (#275) is to implement a Jira issue downloader function that eliminates the dependency on JirAgileR, which is called in vignettes/download_jira_data.Rmd because JirAgileR is no longer maintained. The new downloader function removes the addition of offline metadata that adds overhead and demands more of unit testing. The new function also downloads issues into new files with a new naming convention.

We also implement pagination so that the files can be downloaded in chunks and not all at the same time. The purpose of this is that for very large projects, downloading all the issues may exceed the maximum allowed per hour, which will result in our API access to be blocked or unable to obtain the full dataset if always starting from the first issue (see https://github.com/sailuh/kaiaulu/issues/253).

Working with individual files with different naming conventions was also deemed an ideal way to implement this function in the future to avoid downloading already existing files as well as not downloading more than the current (5000) download per hour limit.

The downloader will also interface with our parser built in (#276) to acquire either date or issue range of the downloaded files so that a refresh capability can be added at a later time.

2. Endpoints

First, we want to determine which JIRA REST API endpoints we want to use. A comprehensive list of available endpoints can be found here: https://docs.atlassian.com/software/jira/docs/api/REST/7.6.1/

Since we are attempting to reproduce the capabilities of JirAgileR: (https://github.com/matbmeijer/JirAgileR/blob/7626a419f8f9e19aa6d73bb65e7a5c1c7c4da26e/R/exports.R#L540-L546), we can analyze their code and find that they use the 'search' endpoint as indicated by this line:

path = adapt_list(url$path, c("rest", "api", "latest", "search")),

This means that Kaiaulu relies on the search endpoint (rest/api/2/search), which we will use for this new function (https://developer.atlassian.com/cloud/jira/platform/rest/v2/api-group-issue-search/#api-rest-api-2-search-post).

This endpoint notably allows us to define parameters that only return results that fall within them.

For instance, to access issues only related to a particular project, one would need to use the project parameter:

https://support.atlassian.com/jira-software-cloud/docs/jql-fields/#Project

use of the project parameter in the jql field may look something like this: https://issues.apache.org/jira/rest/api/2/search?jql=project=GERONIMO

2.1 search by issue key range

One jql query of note is the ability to specify ranges of issue keys in the search endpoint. Searching for issues ranging from 100 to 200 in the GERONIMO project may look something like this:

https://issues.apache.org/jira/rest/api/2/search?jql=project=GERONIMO AND issueKey >= GERONIMO-100 AND issueKey <= GERONIMO-200

This returns only issues whose project keys are within the specified range

2.2 search by date created

The search endpoint allows us to specify fields to return, the most interesting of which is the 'created' field:

https://support.atlassian.com/jira-software-cloud/docs/jql-fields/#Created

These fields can be used to specify jql queries and when constructed, one may look like: https://issues.apache.org/jira/rest/api/2/search?jql=project=GERONIMO AND created >= 2021-01-01

This allows us to construct a query that returns issues from the project "GERONIMO" that were created on or after 2021-01-01. Specification can be as precise as a minute field, the format of which would look like: "yyyy-MM-dd HH:mm"

2.3 Using search by date created for a refresh functionality

The idea of the refresh functionality is to enable the ability to download data that has not already been downloaded. This is especially useful for very large data sets in which the total number of issues is larger than the amount allowed to be downloaded per hour (currently 5000). If data were to begin downloading from the beginning at each function run, then any issues above the limit will never be downloaded. If total number of issues are still below that cap, re-downloading previously downloaded data adds unnecessary overhead.

issue #276 aims to build a parser that will parse the data downloaded from this function. From the parsed function, a refresh capability may be added by iterating through the 'created' fields of the downloaded data and finding a max value (most recently created issue) and appending this data to the jql query when performing the API call. The max value can be found using an external function or from within the current function before making the call.

However, since the Jira API can only use the "yyyy-MM-dd HH:mm" format to specify date ranges, we run into the issue of either downloading duplicate data or skipping data. For instance, if the most recent created date was 2020-01-01 12:31 and we used 'AND created >= 2020-01-01 12:31', then we would download that same issue again. If we were to use 'AND created > 2020-01-01 12:31' instead, then we run the risk of skipping any issues that were create on the same day at the same time. Though this is a very unlikely occurrence, a better method of obtaining new issue data would be by using the issueKey

2.4 Using search by issueKey for a refresh functionality

Instead, we will use issueKey to implement a refresh capability. The parser built in #276 will instead return the filename with the greatest 'created' field by parsing the filenames (filename convention can be found in 3.1). This file should also therefore contain the issue with the greatest issueKey value. A new function will take this file and extract the largest issueKey then append it to the search query called by the downloader function.

3. Data Schema to support Refresh

3.1 Folder Organization

The filepath needed to be changed because previously all issues were downloaded into one single json file. This was okay because issues and comments were all saved in their own .json file. With potentially thousands of files now being downloaded, file for both issues and issue comments should be created.

I analyzed other downloaders in kaiaulu and attempted to created a filepath schema that was similar to theirs. For instance, download_github_comments.rmd saves both issue and pull request data along the path: ../../rawdata/github/kaiaulu. Folders are created inside the kaiaulu folder: "issue" and "pull_request" respectively so that each has their own folder inside a folder 'kaiaulu' that specifies which project. The bugzilla downloader also utilizes a filepath that uses a folder specifying 'bugzilla'.

Currently, the jira downloader filepath is:

../../rawdata/issue_tracker/geronimo_issues.json

For sanity and to adhere to existing filepath schema, we will likewise create new files for issues and issue_comments inside a folder specifying which project these data belong to (geronimo in this example).

The filepath for issues for a project named geronimo would be:

../../rawdata/issue_tracker/geronimo/issues/

The filepath for issue comments for a project named geronimo would be:

../../rawdata/issue_tracker/geronimo/issue_comments/

In the config file, this may look like:

issue_tracker:
  jira:
    # Obtained from the project's JIRA URL
    domain: https://issues.apache.org/jira
    project_key: GERONIMO
    # Download using `download_jira_data.Rmd`
    issues: ../../rawdata/issue_tracker/geronimo/issues/
    issue_comments: ../../rawdata/issue_tracker/geronimo/issue_comments/

it is important to note that 'geronimo' in this example should be replaced with the project name in the config file you are using

3.2 Naming convention

The files are saved into different files because for each pagination loop, the file name will be changed. The current naming convention is the [projectkey]_issues_[UNIXTIME of first issue]\[UNIXTIME of last issue].json for issues and [projectkey]_issue_comments_[UNIXTIME of first issue]\[UNIXTIME of last issue].json for issues with comments where UNIXTIME is the UNIXTIME of the 'created' field of each issue and [project_key] is the project's key from the config file

For example, if the save path passed to the function is ../../rawdata/issue_tracker/geronimo/issues and the first and last UNIXTIME are 4 and 5 respectively (though not realistic), the file will be saved in the directory ../../rawdata/issue_tracker/geronimo/issues and the filename will be GERONIMO_issues_4_5.json.

The distinction of the appended descriptor is important in the creation of the parser in #276 as well being identified for their different contents by filename, particularly iterating through filenames and returning the 'key' field in the file with the greatest [UNIXTIME of last issue]. One functionality of the parser will be to return the maximum 'key' field of all issues. This parser function can be called from within the refresh function before the api calls to append the Created jql query to the api call a la

file_name_with_greatest_issueKey <- "PROJECT_issue_comments_1121646814_1121719175.json"
...
jql_query = paste0(
    "project='", issue_tracker_project_key, "' AND issueKey > ", file_name_with_greatest_issueKey)

In theory, this should return only issues that have key values greater.

3.3 Fields for issues and issues_comments

Fields retrieved can be edited in download_jira_issues.rmd function calls. Currently, the issue downloader has not changed any fields that were called by its predecessor.

Current fields for issues:

issues
description
creator
assignee
reporter
issuetype
status
resolution
components
created
updated
resolutiondate

The fields on the jira UI may look something like this: <img width="910" alt="Screenshot 2024-02-29 at 11 39 54 AM"

Current fields for issue_comments

issue comments contain the same field but also download issue comments: <img width="588" alt="Screenshot 2024-02-29 at 11 40 02 AM"

4. Task List

Step 1

R/Jira.R/download_jira_issues (domain, username, password, jql_query, fields, save_path_issue_tracker_issues, maxResults, verbose, maxDownloadsPerHour):

[x] This function takes the domain, username, password, jql_query, fields, save path, maxResults and verbose parameters and makes repeated calls to JIRA REST API via the rest/api/2/search endpoint to access issues and download them each with their unique naming convention into their own JSON file along the designated save path.
- Append API query to URL
- Authenticate user via username and password
- Request data and error check
- Naming schema for each issue
- Loop until all issues are downloaded are maxResultsPerHour is reached
[x] add a username/password input to the file. The password is an Atlassian token
[x] Add pagination to function. All issues downloaded per “page” should be saved to the same file
[x] Add naming convention: [projectkey]_issues_[UNIXTIME of first issue]\[UNIXTIME of last issue].json where UNIXTIME is the UNIXTIME of the 'created' field of each issue and [project_key] is the project key from the config file
[x] Add an upper bound to the number of issues to be downloaded: maxDownloads. Function will cease api calls before this is reached.
[x] add parameter searchQuery that is optional. This appends a query (eg AND created >= ‘2020-01-01’) to the jql query. This is important for other functions
[x] Avoid using jsonlite::fromJSON JirAgileR used the jsonlite::fromJSON method to retrieve data from the http GET call to the API and so I thought it was okay to use this call again. fromJSON, however, retrieves the data as an R object and then uses write_json to convert that R object to a json file saved locally. The goal is to download the raw data and not alter it in this way at all, so another way of retrieving the data must be found. writeLines() will be used instead .https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/writeLines
[x] Error check if api is returning requested (maxResults) amount of issues per call. length(content$fields$issues) < maxResults and < total. Sometimes if the total number of issues is less than your maxResults, it is because there are simply less issues available given the jql query but sometimes this is also an error.

Step 2

vignettes/download_jira_issues.rmd This notebook contains the function calls to download_and_save_jira_issues as well as instantiating global variables based on value in the designated config file. I will be using geronimo.yml in this notebook.

[x] Seed reset, required packages, instantiate variables from designated config file
[x] chunk to create filepaths if they do not exist already
[x] Call to download all Jira issues
[x] Call to download Jira issues filtered by created date
[x] Call to download Jira issues filtered by issueKey range
[x] Refresh capability of Jira issue donwloads
[x] Call to download all Jira issues with comments
[x] Call to download Jira issues with comments filtered by created date
[x] Call to download Jira issues with comments filtered by issueKey range
[x] Refresh capability of Jira issues with comments download

Step 3

/R/jira.R/download_jira_issues_by_date (issue_tracker_domain, credentials, jql_query, fields, save_path_issue_tracker_issues, maxResults, verbose, maxDownloads, date_lower_bound, date_upper_bound)

[x] takes in optional lower and upper bound parameters for the 'created' field to modify the API call and then use this query to call the main function download_and_save_jira_issues to return only issues in these bounds
```
created_query <- ""
if (!is.null(date_lower_bound)){
created_query <- paste0(created_query, "AND created >= '", date_lower_bound, "'")
}
if (!is.null(date_upper_bound)){
created_query <- paste0(created_query, " AND created <= '", date_upper_bound, "'")
}
```
Created_query is passed as search_query to download_and_save_jira_issues()

Step 4

/R/jira.R/download_jira_issues_by_issue_key (issue_tracker_domain, credentials, jql_query, fields, save_path_issue_tracker_issues, maxResults, verbose, maxDownloads, issueKey_lower_bound, issueKey_upper_bound)

[x] takes in optional lower and upper bound parameters for the 'issueKey' field to modify the API call and then use this query to call the main function download_and_save_jira_issues to return only issues in these bounds

created_query <- ""
if (!is.null(issueKey_lower_bound)){
created_query <- paste0(created_query, "AND issueKey >= ", issueKey_lower_bound)
}
if (!is.null(issueKey_upper_bound)){
created_query <- paste0(created_query, " AND issueKey <= ", issueKey_upper_bound)
}

Step 5

refresh_jira_issues <- function(domain, credentials, jql_query, fields, save_path_issue_tracker_issues, maxResults, verbose, maxDownloads, file_name, unaltered_file_path)

[x] Takes a filename and extracts the greatest value issueKey from the issues in that file. Constructs a query that looks like "AND > issueKey" and appends it to the jql query of download_and_save_jira_issues() function call to only download issues with issueKey values greater than that value.
[x] update download_jira_issues.rmd to set the value of the filename and call the function
[x] Use the parser built in #276 to set the filenames passed to function

Notes:

Use this httr version: https://github.com/sailuh/kaiaulu/blob/9294a9bb92506b76424b0fcc7c07f6b74d5ce53d/DESCRIPTION#L39
Get used to JIRA as an issue tracker using Kaiaulu JIRA Sandbox: https://github.com/sailuh/kaiaulu/issues/252
Try to use the Existing Download JIRA Notebook to download said data. You want to understand how the data you see downloaded relates to the data in the web interface. In addition, consider the data the JirAgileR library offers via Kaiaulu Notebook. You want to ensure you obtain minimally the same data and columns.
Make sure you understand how pagination works.
Want to make sure you use the appropriate API Endpoint in order to enable Refresh, so issue https://github.com/sailuh/kaiaulu/issues/253 is addressed.
For information on JIRA API and how it all works, with examples so you try on a browser, see: https://github.com/sailuh/kaiaulu/issues/228
You want to iterate early with me both in discussion and code so I can provide early feedback before you go too far. Follow the guidelines to send a PR here: https://github.com/sailuh/kaiaulu/blob/master/CONTRIBUTING.md
Also, have a look on how to organize your Issue (in particular the first post, so it is easier for everyone to refer to):https://github.com/sailuh/kaiaulu/blob/master/.github/PULL_REQUEST_TEMPLATE/student.md
Let's adopt Code revision discussion = PR. Issue clarifications = Issue. Separating is more accessible as the various code revisions can make finding information of the issue harder.

I may edit this message to add more information if needed unless I am replying to a subsequent comment. In that case, no e-mail will be sent as a notification.

I have written a new function that downloads the data based on the fields required. There are separate function calls to acquire the issue data and the comment data along the search api. I have recently implemented pagination:

https://github.com/Ssunoo2/kaiaulu/blob/a0b18712865a2e29605c09424f2f987f1ba27eb0/vignettes/Download_jira_data_without_JirAgileR_pagination_newfilesforeach.Rmd#L133-L144

The pagination works by constructing a new filename along the path designated in the config file. Currently it appends the issue id and the time downloaded, though these are subject to change in the future, especially if they become pivotal in implementing a refresh capability.

I am experimenting with ways to implement a refresh capability. My main lack of understanding right now is what field I should be checking, though I am leaning towards the "created" field so that I only download issues created after the most recent created date that exists in the files already downloaded. I was also toying with the idea of using the issue id field, though I'm not completely sure how these numbers are created, as for instance in the geronimo issues, there are ids between 0 and 782, then seemingly out of place issues 2665, 4273, 5184, etc, and I am unsure if they are necessarily correlated to their created date (eg larger ids are more recently created). If they are, then I can do a check for the largest issue id and then write something that only downloads ids larger than the current largest. I am referencing the information in #228 to help guide me.

A couple notes from https://github.com/sailuh/kaiaulu/issues/275#issue-2116131626:

add a username/password setting to a config file. The password is an Atlassian token

What do you mean? Are you suggesting the username and password will be added to the project config file?

The new downloader function removes the downloading of extraneous data

I believe you are referring to baseinfo and extinfo here? If so, then this is not quite accurate. This is not additional data downloaded by JirAgileR, it is downloading the issue data as we are from the endpoint, but it adds additional metadata offline. This just adds overhead and forces unit tests on our end to comply to its interface rather than the issue tracker. That's why we want to get away from it.

The purpose of eventual pagination is the future implementation of a refresh capability in which only issues that have not already been downloaded can be downloaded again.

This is not accurate. Pagination is its own separate thing than the refresh capability we also want to add. Pagination just means you are not just downloading the first page worth of issues, but iterate through multiple pages. You can have pagination without the refresh capability. But this is inefficient: Every time you run the function, it downloads everything again. The problem is introduced is reflected on https://github.com/sailuh/kaiaulu/issues/253 for GitHub which has an API key, but I can imagine for very large JIRA projects, an IP block would be issued if it just tried to download everything in one go. This is why the refresh is so important: So the data can be downloaded in chunks.

I will also add that the downloader should be able to request information from the parser to know the time range (and/or issue range) to download the new issues accordingly. The parser will not provide the downloader with a page number: This would introduce duplicates. Therefore, the first comment of this issue has to make it clear what end points will be used, and how it will communicate with the parser to obtain said information @ian-lastname). I.e. in this issue it should be clear what function from the parser specification it will obtain the needed information and how it will handle it (on your function specification).

vignettes/fetch_and_save_jira_issues.Rmd/fetch_and_save_jira_issues

This is not clear to me. The functions are defined in R/ and used in the vignettes. In your case, this would be R/jira.R.

Please also check and include on your first post the convention of function names used in the current JIRA downloader, as well as the GitHub and Bugzilla downloader. I do not believe fetch_and_save_jira_issues is consistent with them.

Make a call to fetch the issues with relevant fields data Make a call to fetch issue comments with relevant fields data

Please expand this accordingly to the template specification. Also include the discussion associated on how the refresher will work as noted above under the relevant functions.

I'm wondering if the download format is supposed to have multiple issues per page. The example one does here: https://issues.apache.org/jira/rest/api/2/search?jql=project=SPARK%20AND%20created%20%3E=%202021-01-01%20AND%20created%20%3C=%202021-01-02

And here is a screenshot of how my issues are downloading right now:

Initially I thought that each issue should download in its own file, but mirroring the template, I altered my function to download each issue for each pagination loop (eg if maxResults is set to 5, each chunk downloaded would download 5 files) to download into the same file like the image above.

@ian-lastname has told me that his parser function does work on both the template and this file which was downloaded by my function, but I just wanted to double check that this format is correct

It also looks like the other values are zero indexed:

This doesn't seem to be an issue for the parser so I wondering if you think this will cause any issues

Comment 1

@Ssunoo2 In regards to this comment (https://github.com/sailuh/kaiaulu/issues/275#issuecomment-1953553151), can you double check the raw JSON is also different using the pretty print tab? Just want to be sure we are not guessing things are different when the formatter of Firefox is the issue here. Would be helpful seeing how they compare at raw format too.

Comment 2

In regards to this issue requirements (https://github.com/sailuh/kaiaulu/issues/275#issue-2116131626), I edited your comment and formatted a bit. Enumerating sections and tasks makes easier for me to refer them to. On that note:

I added a Section 2, endpoints. This is a very important aspect of this downloader. I would like you to move some of issue's #228 discussion of endpoints here. In particular, I would like you to enumerate the information there that contains examples of endpoints, including how to query the parameters by date and issue since your downloader has to support that in order to support the refresh. You do not need to explain how it works, just what options are available. One of your tasks then has to more clearly specify how they are being implemented in your function, and how you use the information from #276 to have refresh.

I also added a Section 3, Data Saving Schema. Please elaborate on this section on the file organization. This is vital, and interfaces with @ian-lastname so it should be easier for him to find when working on his task.

Task 4.2 should be a single file, not two:

your_api_token <- scan("~/.ssh/atlassian_token",what="character",quiet=TRUE) username <- scan("~/atlassian_username",what="character",quiet=TRUE)

Task 4.3: You should double check what happens when maxResults is upper bounded by the website you are requesting too. For example, if you were to put 999999 I am fairly certain the website will force a lower number. If you do not handle this properly, you will lose issues. What the team that implemented the bugzilla downloader did was, they allow the user to set a maxResults upper bound, but after the very first download, they scan the downloaded JSON to see how many were actually downloaded. This is then used to update the maxResults thereafter to avoid skipping issues. Please check their code if needed and/or issues.

Task 4.4: The name convention of the file doesn't make sense to me, since you prefix the file name with the path. Please add an example.

Task 4.5: Please don't sleep the code. Nobody wants their session locked when running the downloader. The code should finish when the limit is reached. In the case of JIRA, that means at every download request, you need to check if you are receiving no data. If that is the case, your code should simply end and raise a warning (not an error) to the user. This is exactly why we need the refresher capability. So when the user calls the refresher, they can then resume.

Task 4.6: I am not sure why you keep referring the refresher is for a later date: The refresher is part of milestone 1 and should be available from the get go. Please make sure you consider how that will be accomodated on your function.

Task 4.7: Geronimo is fine, but bear in mind they have a lot of issues. You want to be careful you start testing your code downloading just a small range of issues using the endpoint, or you may get IP blocked! You should also test with Kaiaulu JIRA, see Kaiaulu config file.

Thanks for the feedback! Regarding

difference in json formats

from my downloader:

Original Reference:

It does look like the raw forms are slightly different for the expand, startAt, maxResults, and total than the template. @ian-lastname has said that both of these work for his files, however. I also looked at the downloaded files from the original downloader and they do not include these so I am inclined to think that the difference in formatting may be inconsequential for now if not more overhead.

Task 4.2

my understanding is that we need both a username and a password to authenticate. I am using this code:

if(!is.null(username) && !is.null(password)){
      auth <- httr::authenticate(as.character(username), as.character(password), "basic")
    } else {
      auth <- NULL
    }

What do you mean that it should only be one line? should one or the other just be written by a user manually into the function call?

Task 4.5:

I am running a loop for each page requested and I have this code here:

if (length(content$issues) < maxResults) {
      break
    } else {
      startAt <- startAt + maxResults
    }

The idea is that if less issues are downloaded per page than the number expected per page (maxResults), then there are no more issues to be downloaded and the function ceases api calls

I have also added another parameter to the function call maxDownloadsPerHour. This essentially adds the number of issues downloaded to a counter and if the counter equals or exceeds this number, it currently sleeps until the next hour. But I understand that no one wants their system sleeping so I suppose I will change this to just break out of the function and return a message that the max downloads were reached. I could also add a check and make sure that maxResults does not exceed this and returns an error if it does.

Refresh capability

My thought is to go through existing files (if there are any) in the save path and retrieve the most recent 'created' field and save this as a local variable, then append this to the jql query to retrieve issues only created after this date:

maxValue <- parseMaxValue(data)
...
jql_query = paste0(
    "project='", issue_tracker_project_key, "' AND created > '", maxValue, "'"

This is the basic idea of how I am currently intending to implement it. Since the parser is not completed yet, I will try to implement and test it with a dummy value in YYYY-MM-DD format

Regarding Format

The difference is important. The fake data generator got a lot of unnecessary code due to complications of very little formatting. Instead of comparing downloaders, please use the JIRA API docs, there should be an example. Could you paste it here?

Task 4.2

Not one line. I mean one file. Please use one file for the user to include username and password, not two.

Task 4.5

The idea is that if less issues are downloaded per page than the number expected per page (maxResults), then there are no more issues to be downloaded and the function ceases api calls

No, this is a risky assumption. I am saying that you may specify you want 9000 issues per page, but the project's API may return 50 instead. After you download the first time, you have to double check how many issues you actually got and adjust your maxResult accordingly, while issuing a warning to the user. See the bugzilla downloader and associated issue discussion.

I have also added another parameter to the function call maxDownloadsPerHour. This essentially adds the number of issues downloaded to a counter and if the counter equals or exceeds this number, it currently sleeps until the next hour. But I understand that no one wants their system sleeping so I suppose I will change this to just break out of the function and return a message that the max downloads were reached. I could also add a check and make sure that maxResults does not exceed this and returns an error if it does.

Yes, raise a warning so the user knows not all data is downloaded, and advise they should use the refresher at a later time. We should likely add an upper bound on the total number of issues the user wants to download after calling the function. If the project has 1 million issues, the user may wish to download a smaller set a day to prevent ip block. Do not lock the user session on sleep.

Refresh capability

I will go over this after I have a chance to re-read your first post. Thanks for updating!

Please update the first issue to reflect the agreed file schema and the list of functions before it gets lost to memory.

The files are saved into different files because for each pagination loop, the file name will be changed. The current naming convention is the [savepath][UNIXTIME of first issue][UNIXTIME of second issue].json where UNIXTIME is the UNIXTIME of the 'created' field of each issue. This makes the filenames less readable but easier to search for in the parser that is being built

This is not correct, I do not see the _ . In addition, please update the first issue to reflect the requested changes: I still see the token defined as two files instead of one.

Add sub-section enumeration, markdown is hard to tell what is sub-item from main item. The task is still too verbose and each bullet does not reflect a function. Please move details to sub-bullets. Formatting is a necessary step here given the limitations of markdown I'm afraid.

I've tidied up my first issue comment. Additionally, I've updated the correct authentication sample code and the naming convention. Turns out I had forgotten a backspace before the underscore so the underscores were not visible.

Api Request errors

you may specify you want 9000 issues per page, but the project's API may return 50 instead. After you download the first time, you have to double check how many issues you actually got and adjust your maxResult accordingly, while issuing a warning to the user. See the bugzilla downloader and associated issue discussion.

I have added a check on the first pagination request. If the number of issues returned (issue_count) is different from the number requested (maxResults) and the total issues that can be downloaded (total) is confirmed to be greater than maxResults, it will adjust the maxResults accordingly and print an informative message. I am going to work a little bit more on this because I feel it may need more work.

if ((downloadCount == 0) && (maxResults != issue_count) && (total >= maxResults)) {
      message("Total number of issues queried: ", total)
      message(". maxResults specified: ", maxResults)
      message(". Number of issues retrieved: ", issue_count)
      message(". Something went wrong with the API request. Changing maxResults to ", issue_count)
      maxResults <- issue_count
    }

Warning message when maxDownloads is reached

I've added a check to make sure that the function ceases API calls when the upper limit (maxDownloads) is reached

 # Check update our download count and see if it is approaching the limit
    if (downloadCount + maxResults > maxDownloads) {
      # erorr message
      time <- Sys.time()
      message("Cannot download as maxDownloads will be exceeeded. Reccommend running again at a later time. Downloads ended at ", time)
      break
    } # Resume donwloading

Could you please add a sub-section on Section 3 that mentions the folder organization I showed on last call for bugzilla, github, and how jira is organized to justify the folder naming organization. I don't see that here.

Issue comment now includes folder organization and screenshots of the jira UI with fields highlighted

I identified an issue in the refresher that results in duplicate issues being downloaded. Because the format of the API request is YYYY-MM-DD, if you were to identify 2020-01-01 as the most recent value of the created field and then appended ‘AND created >= 2020-01-01’ to the api request, you would download the issues created on that date again. Conversely, if you were to use ‘>’ rather than ‘>=’, you may skip any files that were created on 2020-01-01 but at a later time than the ‘most recent issue’. For this reason, I will adapt the function to to use issuekeys instead.

The parser created in #276 previously returned a date of the most recently created file in a filepath. It will now be altered to return the filename that holds the most recent issue. A new wrapper function will be created that will open this file and return the greatest issueKey value and append it to the JQL query so that only issues greater than this key will be downloaded. This should solve the issue of duplicate files being downloaded.

Follow-up from call: @Ssunoo2 confirmed issue date range can go down to minute level. Should add that to the function documentation and extend the date range function to account for that. The actual refresher will continue to use JIRA issue IDs, so nothing else should change.

I ran into a small issue where, when replacing camelCase variable names with not_camel_case, the downloader broke and would download the first page until the limit was reached. The problem was with this line here:

# Correct
query_params <- list(jql = jql_query, fields = paste(fields, collapse = ","), maxResults = max_results, startAt = start_at)

Since we're using these values as parameters, the api endpoint uses maxResults and startAt as values and so changing these to max_results and start_at :

# Incorrect
query_params <- list(jql = jql_query, fields = paste(fields, collapse = ","), max_results = max_results, start_at = start_at)

would set the maxResults to default (50) and startAt to default (1). It also broke updating the startat and so the startAt value would remain at 1 indefinitely and continue downloading the same page until a limit is reached.

I've edited back the naming convention on commit 9161bae to be PROJECTKEY_STARTUNIX_ENDUNIX.json as of the commit above.

If there is time, I will add a regex that checks for the presence of the comment field to specify if it is a request for issue, or issue with comments. However, considering a JQL query can virtually have everything the user wants field wise, I guess it will be less misleading if it only guarantees what is known for certain.

Now we know the identifier of github is also different (i.e. owner/reponame), it is fairly safe to say that PROJECTKEY tells apart JIRA files too for someone manually browsing (the code does not need said assumptions since each function handles a different project).

@Ssunoo2 well done! This issue specification (first comment) speaks lengths of how much thought and consideration was put into the downloader. :^)

I am closing this issue. With it, Milestone 1 is officially behind us.

I still have a open question on what time zone the date endpoint uses on request since there is no field for timezone. If you are aware, please feel free to comment here without re-opening the issue. Ultimately the refresher uses issue keys, so this is of lesser concern for the JIRA downloader, but it will be a bigger concern on #282 and #285.

sailuh / kaiaulu