pycontribs / jira

Python Jira library. Development chat available on https://matrix.to/#/#pycontribs:matrix.org
https://jira.readthedocs.io
BSD 2-Clause "Simplified" License
1.96k stars 873 forks source link

Provide Pagination Support #1010

Open CallMeBW opened 3 years ago

CallMeBW commented 3 years ago

Many of the JIRA resources come with a pagination approach. The initial request returns only a chunk of data. The developer has to take care of retrieving all results by incresaing the startAt parameter. Besides producing boilerplate code, this is a potential source of errors, as it requires to keep track of the last index returned etc.

If the Jira Wrapper came with a pagination functionality, it would be more convenient to work with the API. For example, to retrieve all sprints should be a one-liner. I would like to propose a class that takes a generic jira function and provides a function that accumulates all results.

"""
Module to help with pagination of jira wrapper functions.
"""
from typing import Any, Callable
from jira.client import ResultList

class Pager:
  """ A Pagination Handler for the jira API wrapper.
  Jira returns result lists with limited resources. To retrieve all resources,
  a number of reqursts have to be made. The query parameters 'startAt' and 'maxResults'
  influence the response. This Pager class allows to sequentially query the next chunk of
  results until all results are available. Note, that 'maxResults' is an upper limit, and
  may be undercut by a lower server-side limit.
  To use the pagination, pass it a function from a jira wrapper, like jira.sprints (to retrieve
  all sprints by board id). The arguments which are usually passed into that functions will be
  passed to the Pager class as subsequent args and kwargs.
  Example: create a pager which queries all active sprints:
    pager = Pager(jira.sprints, board_id, state='active')
  In order to retrieve a list of all results, use the Pager.full_result() function.
  The kwargs cannot specify 'startAt', since the pager takes care of that.
  However, you can include 'maxResults' in the kwargs, if you want to restrict the number of
  results per page.
  """
  _callable: Callable[..., ResultList]
  _callable_args: tuple[Any, ...]
  _callable_kwargs: dict[str, Any]

  def __init__(self, jira_func: Callable[..., ResultList], *args, **kwargs) -> None:
      assert 'startAt' not in kwargs, "startAt is handled by pager"
      self._callable = jira_func
      self._callable_args = args
      self._callable_kwargs = kwargs

  def full_result(self) -> list[Any]:
      """
      Uses the initially provided function with the specified args and kwargs to query the
      jira api. This function collects the results by applying pagination, until it reaches the
      last page. The last page is identified by the .isLast property of the query's resultList.
      If the .isLast property is None, it assumes there are no pages, hence returning the initial
      result list.
      """
      all_results = []
      first = self._callable(*self._callable_args, **self._callable_kwargs, startAt=0)
      is_last = first.isLast # .isLast may be True, False or None
      all_results += first

      while not is_last:
          new_results = self._callable(*self._callable_args,
                                        **self._callable_kwargs,
                                        startAt=len(all_results))
          assert new_results, "no new results were retrieved"
          all_results += new_results
          is_last = new_results.isLast
      return all_results

The above class could furthermore be extended to provide a function that returns a continuous iterator. This iterator would iterate over all results, and when the end of a page is reached, the next request with adapted startAt will be called. However, I haven't included it yet, as it would not be transparent to the user as of which call to __next__() would trigger the longer loading time.

Here is an example how to use the Pager class to retrieve all active sprints:

active_sprint_pager = Pager(jira.sprints, board_id, state='active').full_result()
adehad commented 3 years ago

There is a _fetch_pages() function that is called in most functions including jira.sprints . Which I think aims to do what you are describing. Have you noticed this not working as intended? Perhaps there is a bug in that function itself. https://github.com/pycontribs/jira/blob/eb80088cd0da1d27043a2d457f2f045725ef97f0/jira/client.py#L576-L601

Doesn't necessarily mean we shouldn't implement this Pager class, but it may allow it to be integrated better into existing functions.

burkestar commented 3 years ago

Although _fetch_pages does handled paginated routes, it doesn't return a generator. So what happens is it retrieves all the results one page at a time and then it returns all the results in memory to the caller. Instead, it would be more efficient when working with a large number of issues, to yield a generator with 1 page of results at a time.

MSKDom commented 2 years ago

Is this really working though?

When running search issues it returns _fetch_pages so the returned list should be complete. However clearly it does not match itself.

issues = jira.search_issues('project = myproject', maxResults=1000)
issue_data = []

for issue in issues:
    issue_data.append(issue)

print(issues.total)  #513
print(len(issue_data))  #100

Update:

Actually there seems to be a bug with maxResults as returned object maxResults is set to 100.

If I set maxResults=False or None, then it works as expected. Quite unintuitive I must say.

adehad commented 2 years ago

@MSKDom would be great if you can raise this behaviour in a new issue to track it better. I am unable to pick it up at this time, but perhaps another contributor (or even yourself) can further investigate a solution to this behaviour.

bartromgens commented 9 months ago

I created a new issue for this in #1819