Use intelligent pagination caching for datasources

rarkins commented 2 years ago

What would you like Renovate to be able to do?

Use intelligent pagination caching for popular datasources, especially github tags/releases and Docker tags. Then we can maybe remove pagination limits.

If you have any ideas on how this should be implemented, please tell us here.

It should be implemented in a similar manner to how we did caching for GitHub PRs. e.g. if it's possible to sort by a last modified field then reuse the existing cache (even if "expired") and only retrieve as much as necessary on subsequent requests.

Notes:

We may need to first adjust our cache logic so that we don't fully expire cached values and instead do a soft expiry. e.g. hard expiry can be 2x the soft expiry so there's usually enough time for us to fetch an update using cached results
Should use our datasource cache, not repo cache

Is this a feature you are interested in implementing yourself?

No

Sub-issues

[x] #15645

zharinov commented 2 years ago

Seems like releases/tags result can not be ordered by updated_at field, though default sorting looks good enough to me and smart pagination is still relevant.

zharinov commented 2 years ago

The approach I'm about to try:

github-tags
- Initial fetching:
  - use ls-remote
    - similar to git-refs
  - single "page"
- Novelty fetching:
  - use GraphQL API
  - paginate
  - sort by TAG_COMMIT_DATE
    - approximately equivalent to updated_at field
github-releases
- Initial fetching:
  - Current implementation
- Novelty:
  - Treat fresh releases as subject to change (for the first 7 days or so)
  - Fetch page by page until the first non-fresh has been encountered

Both caches are meant to reset after some prolonged period, maybe 24 hours or so.

rarkins commented 2 years ago

I'd like to see the github-tags approach with ls-remote initially. Let's test and deploy that (need to make sure it works with private repos and custom GHE endpoints too)

zharinov commented 2 years ago

Do we actually use isStable flag for github-tags datasource? Seems like we have an option to fetch releaseTimestamp via GraphQL, though still can't access prerelease flag on which githubRelease.isStable is based on.

My point is that if we only need releaseTimestamps for tags, then hopefully we can obtain it from the commit info via GraphQL together with tag name and commit hash.

zharinov commented 2 years ago

Another 3-level design:

fetch all tags+hashes via ls-remote
query 10–100 items for timestamp via GraphQL (concurrently with the previous step using Promise.all)
after this fetch novelty page-by-page of 10 items via GraphQL (tags+hashes+timestamps)

This is based on possibly wrong premise that we actually don't need timestamps for older items.

rarkins commented 2 years ago

We want timestamps for all. And I think isStable is important for GitHub Actions, where some actions have sometimes had pre-releases published with stable semver versions

zharinov commented 2 years ago

All releases have corresponding tag, but not all tags have corresponding release. This means our current implementation doesn't guarantee releaseTimestamp field for every tag. We can achieve this using GraphQL:

refs(
  refPrefix: "refs/tags/"
  first: 10
  orderBy: {field: TAG_COMMIT_DATE, direction: DESC}
) {
  nodes {
    version:name
    target {
      ... on Commit {
        hash: oid
        releaseTimestamp: committedDate
      }
    }
  }
}

Still need to obtain isStable flag values from the github-releases datasource (hopefully can optimize this too)

zharinov commented 2 years ago

Initial fetching using GraphQL would have penalty on initial fetch for repos with long list of refs, but should be okay with populated cache

rarkins commented 2 years ago

So we couldn't detect tag deletions this way, right? Would need a periodic full fetch with cold cache?

zharinov commented 2 years ago

Yes, it's the problem for both releases and tags, so I think the remedy will be similar

zharinov commented 2 years ago

Though it can be implemented as some gradual process, i.e. check and reconcile just one page of previously stored results per run. Not sure about this for now.

rarkins commented 2 years ago

Does GraphQL have any "maximum 10 pages" limitation or can we use it to fetch 100 per page until we have all?

zharinov commented 2 years ago

I don't think it would limit us. However, unlike REST, we have to fetch pages sequentially because of cursor-based pagination mechanics.

rarkins commented 2 years ago

So using GraphQL the idea would be the following?

If no cache: fetch 100 per page sequentially until done.

If cache, and less than short term cache expiry time (e.g. 30 minutes), then use cache.

If cache, and short term cache expiry has been hit, then fetch 100 per page until some date limit is hit (e.g. one month). Merge any new data with old (including missing tags) and overwrite existing cache.

If long term cache expiry is hit (e.g. one week), then treat like "no cache" scenario?

Result being that we'd perform on average ~one page of fetching every 30 minutes compared to today when we do up to 10 pages every time the cache expires?

zharinov commented 2 years ago

Sounds good to me

rarkins commented 2 years ago

Does the same work for releases?

zharinov commented 2 years ago

Yes, hopefully it'll share common logic

saibotk commented 2 years ago

Be aware that changing such a fundamental logic in those datasources leads to non-semver compliant renovate releases. After some digging through the recent changes in this project, i found the change in #15645 to break my existing config. This is due to the implicit switch from GHs REST API to the GraphQL API for releases / tags introduced in this PR. Since my config does not have any GH Token associated with it, it now cannot fetch tags and releases anymore.

Is there any interest in providing a solution to still use tags/releases without a token? Otherwise id just like to inform you about the breaking change here, so that we can look more carefully at future changes like these, to let them cause a major version bump.

As a workaround i successfully used git-tags and provided the full git URL.

renovatebot / renovate