telia-oss / github-pr-resource

Github pull request resource for Concourse
MIT License
182 stars 5 forks source link

Extremely high V3 API usage when using paths filtering #142

Open YenTheFirst opened 5 years ago

YenTheFirst commented 5 years ago

Given (1) a github repo with two files:

- frequently_updated_file.txt
- rarely_updated_file.txt

and (2) a resource that is configured to emit new versions of PRs that modify rarely_updated_file.txt:

- name: pr-branch
  type: pull-request
  source:
    access_token: ((github-token))
    paths:
    - rarely_updated_file.txt
    repository: example-user/example-repo

and (3) a large number (30+) of open PRs affecting frequently_updated_file.txt, but not rarely_updated_file.txt,

then, every instance of a check of this resource will result in a very large number of V3 API calls. These calls seem to specifically be related to checking for modified files, https://github.com/telia-oss/github-pr-resource/blob/master/github.go#L181. Adding a debug line after this response indicates the response.Rate decrementing multiple times per check

We've run in to this issue when trying to use this resource-type on several pipelines for projects that live in a monorepo, which use paths: to limit to PRs that affect a certain pipeline's project. Rarely updated projects end up continually re-polling every unrelated new PR to check for modifications.

This seems thorny, and I'm not sure how to work around it, short of emitting occasional "fake" versions that the downstream pipeline would need to check for and ignore. It seems like a problem with Concourse's resource API design?

YenTheFirst commented 5 years ago

On further investigation, I believe this can be mitigated by caching github API responses in the persistent Check container.

itsdalmo commented 5 years ago

Hi @YenTheFirst and thanks for #143 , sorry I've taken so long to respond.

[...] every instance of a check of this resource will result in a very large number of V3 API calls. These calls seem to specifically be related to checking for modified files [...]

This is correct. When using paths and ignore_paths each discovered version (new/latest commit to an open PR) will require at least N V3 API calls where N >= 1 (its one API call per 100th changed file for pagination).

Unfortunately I don't think there are any good ways to remedy this either. Unless Github's GraphQL API allows us to filter the PRs by modified files directly, we'll always need to get a complete list of the changed files in order to do the filtering on our end. Even if we use the PullRequestChangedFile connection for the PullRequest object in the GraphQL API, the cost will be the same, it will just count against a different rate limit (so arguably it's better to keep it split).

That being said, I don't think caching is a good solution either, at least not in the resource itself - perhaps it's possible to write a separate task to do that particular filtering? 🤔

YenTheFirst commented 5 years ago

It looks like caching is how the jtarchie version of this resource handled this problem, and adding caching seems to have entirely resolved the issue for us. we're not using webhooks, though.

was there some poor interaction between caching and webhooks that motivated this resource to not use caching?

On Wed, Sep 4, 2019, 05:22 Kristian notifications@github.com wrote:

Hi @YenTheFirst https://github.com/YenTheFirst and thanks for #143 https://github.com/telia-oss/github-pr-resource/pull/143 , sorry I've taken so long to respond.

[...] every instance of a check of this resource will result in a very large number of V3 API calls. These calls seem to specifically be related to checking for modified files [...]

This is correct. When using paths and ignore_paths each discovered version (new/latest commit to an open PR) will require at least N V3 API calls where N >= 1 (its one API call per 100th changed file for pagination).

Unfortunately I don't think there are any good ways to remedy this either. Unless Github's GraphQL API allows us to filter the PRs by modified files directly, we'll always need to get a complete list of the changed files in order to do the filtering on our end. Even if we use the PullRequestChangedFile https://developer.github.com/v4/object/pullrequestchangedfile/ connection for the PullRequest https://developer.github.com/v4/object/pullrequest/ object in the GraphQL API, the cost will be the same, it will just count against a different rate limit (so arguably it's better to keep it split).

That being said, I don't think caching is a good solution either, at least not in the resource itself - perhaps it's possible to write a separate task to do that particular filtering? 🤔

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/telia-oss/github-pr-resource/issues/142?email_source=notifications&email_token=AADMPY26QJONBS7IIR2XEQ3QH6R7TA5CNFSM4IN55BT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD53MCII#issuecomment-527876385, or mute the thread https://github.com/notifications/unsubscribe-auth/AADMPY57HV63OEMQQGYDBLDQH6R7TANCNFSM4IN55BTQ .