Permanent caching for GitHub issues and PRs

zharinov commented 7 months ago

Describe the proposed change(s).

It seems like we have solution for reliable caching:

Use /repos/{owner}/{repo}/issues endpoint
Use since parameter, which works similarly to If-Modified-Since header
Leverage the fact that empty result set always has same ETag value (seems like it's 89d4eb0a983ee9362287f95c869f0afd61584ac481494969bcaedc98e33feb50 across all the repos)
- Although it's undocumented feature, probably we could rely on it (otherwise, it's problematic to combine since query param with If-None-Match: <etag> such that we ever receive 304)
- As an alternative, we could sync to the latest date until the result is empty, store empty result ETag, and perform the next sync only for 200 status: it's one of more complicated solutions I can think of, but maybe it's safer

rarkins commented 7 months ago

This would mean that we use the issues endpoint to fetch both issues and pulls. Some things to consider:

10*100 = 1000 limit starts becoming limiting. On the other hand, some repos could potentially have so many historical PRs that we're at risk of exhausting rate limits. Ideally we'd find a way to fetch and permanently cache all
Issues can be deleted, although pulls usually can't. We should make sure that doesn't break us permanently. It is ok if we had a temporary error if e.g. the user deleted a dependency dashboard issue and then we can't fetch it, as long as we can detect that failure and recover
I now have a faint memory that GitHub admins can possibly delete pulls if a user requests it due to it containing a data leak. But not certain of this - maybe they delete the commit/blob instead.
Good idea to always use the etag for "empty", assuming it means that you get back a 304 if nothing modified since the last fetch, and a 200 if anything's modified
We need to think how to handle the case where a user transfers bot accounts and wants Renovate to understand this and not recreate or duplicate PRs. In such a case we need to either fetch every single issue/PR, or allow a list of old accounts to fetch

To handle the scenario where there' a huge number of old pulls, and possibly tripping the rate limit, we could do this when initially populating the cache:

Fetch the issues endpoint sorted by oldest first
If we get an error part way through fetching (e.g. rate limiting) then we still save what we have so far, including the date of modification of the most recent issue/pull we fetched
On the next run, we reuse the partial cache and keep going, setting since= last modified PR (kind of same as a run with a fully populated cache)

zharinov commented 7 months ago

Is that correct that only issue Renovate interested in is own Dependency Dashboard only? I.e. we fetch all the issues only to search for this one?

rarkins commented 7 months ago

We care about any issue we've created, which I think is limited to Dependency Dashboard and config warnings. And we should be filtering based on creator, which means created by Renovate.

zharinov commented 7 months ago

Looks like it's the small fraction of all items, compared to PRs. So probably it's not a big issue (pun intended) to additionally verify, right before return, each findIssue() result by performing GET request (with If-None-Match: <ETag> header).

UPD. And yes, ETags returned with POST/PATCH requests won't work with GET/HEAD

rarkins commented 7 months ago

We have gitIgnoredAuthors for if the git committer changes (including a bot rename), but that's not directly relevant here. The one which is relevant is ignorePrAuthor which is a big different - instead of an explicit list like in gitIgnoredAuthors it the matches anything. Today for GitHub that setting means we don't filter PRs by username and fetch all. I wonder if we should deprecate remove that setting and instead require a list of other usernames to be more efficient (IF it's more efficient - maybe needs a query per username).

zharinov commented 7 months ago

Actually, we can't use /issues endpoint as the single source of truth for the cached PRs. The reason is simple: it doesn't contain all the fields we need, so we still have to reach /pulls endpoint.

zharinov commented 7 months ago

The best we probably could do is to query the latest updated issue during the platform init, and infer the "dirtiness" of the cached issues from it.

zharinov commented 7 months ago

Another option:

We could cache issues GraphQL response
We reuse current PR cache
We use /issues endpoint to trigger these caches sync
- If the response is less than 100 items (not paginated):
- Changed issues items could be inserted to cache without any new GraphQL queries
- From PR data, we could determine whether changes are internal (i.e. present in the cache) or external (need to be synced)

rarkins commented 7 months ago

No since param supported here: https://docs.github.com/en/rest/pulls/pulls?apiVersion=2022-11-28#list-pull-requests

So I guess we keep the getPrList() function as-is using /pulls REST API, using If-Modified-Since header and hopefully getting plenty of 304s.

For Issues, we could consider fetching them as part of the initRepo graphql query, I think the most we'd need is maybe 2-4 sorted by recently modified. Normal case is we'd have one dependency dashboard open, and zero or one closed config warning issues.

zharinov commented 7 months ago

I don't think it will be plenty of 304, I've just checked it worked zero times out of 10–20 attempts

rarkins commented 7 months ago

I get a much higher success rate, for example got a 304 just now for a repo I last ran a few days ago

renovate-release commented 6 months ago

:tada: This issue has been resolved in version 37.289.0 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

renovatebot / renovate

Permanent caching for GitHub issues and PRs #27641

Describe the proposed change(s).