taiki-e / install-action

GitHub Action for installing development tools (mainly from GitHub Releases).
Apache License 2.0
261 stars 32 forks source link

Speedup CI manifest #540

Closed NobodyXu closed 3 months ago

NobodyXu commented 3 months ago

CI manifest is quite slow, sometimes taking up to 20m.

I think using a dedicated github token might help, secrets.GITHUB_TOKEN only provides 1000 API hits per hour, where one created by user gets 5000.

Switching to GraphQL might also help, since we can fetch all the artifacts information when fetching the release information.

NobodyXu commented 3 months ago

A more efficient approach would be webhooks, though it can only be created by the repository owner.

We can ask them to help with this project, it would be the most efficient way without having to go through GitHub API, and most GitHub friendly.

The webhook information can be cached on a dedicated github branch, so that the manifest CI won't refetch them.

taiki-e commented 3 months ago

cc @jayvdb

AFAIK this is especially slow since the recent codegen changes (https://github.com/taiki-e/install-action/pull/473).

fetch all the artifacts information when fetching the release information.

Currently, we always call requests that have a limit of 100 releases per request in several times to get all the releases. However, if codegen is called with the "latest" argument, as in CI, it is usually sufficient to look at the latest 100 releases.

Eventually we may need to change the overall logic, I guess there are a number of areas that could be made better without such a big deal.

NobodyXu commented 3 months ago

Eventually we may need to change the overall logic

@taiki-e I think using web hook for certain projects and cache them in a dedicated branch would be much efficient.

cc @sunshowers

taiki-e commented 3 months ago

By the way, in my understanding, one of the actual causes of the "slowness" is that even though the token has reached the rate limit, that fact is not remembered on the codegen side, and it attempts to download with the token that has reached the rate limit each time the download is invoked, repeatedly.

NobodyXu commented 3 months ago

that fact is not remembered on the codegen side, and it attempts to download with the token that has reached the rate limit each time the download is invoked, repeatedly.

The binstalk-downloader crate checks for rate limit

https://github.com/cargo-bins/cargo-binstall/blob/0d7080e6a9e1ca43a37026b52683e367c416c87f/crates/binstalk-downloader/src/remote.rs#L182

and applies a backoff strategy

https://github.com/cargo-bins/cargo-binstall/blob/0d7080e6a9e1ca43a37026b52683e367c416c87f/crates/binstalk-downloader/src/remote/delay_request.rs#L70

NobodyXu commented 3 months ago

We also have crate binstalk-git-repo-api for some GraphQL APIs.

I could add a artifact listing API for specific version, if you need any.

jayvdb commented 3 months ago

AFAIK this is especially slow since the recent codegen changes (https://github.com/taiki-e/install-action/pull/473).

Yes, this is the main cause. It added one github API fetch of each tools https://api.github.com/repos/{repo}, and then also quite a few github_head() calls per tool to find the license files.

A very quick improvement for the license file fetches is to use the existing manifest data to first verify the existing URLs still exist, and fallback to the existing logic which attempt to find the appropriate license files.

We can likely also skip a lot of fetching by storing, in the manifests, a copy of the github headers related to caching, e.g etag , which exists on the API hits, including /releases list, (these provide a weak etag identifier) and on the binary release assets (a strong etag identifier).

Due to the way the code is structured, the /releases list API and release asset fetches are easy to skip if github.com says the content isnt modified.

https://api.github.com/repos/{repo} are not likely to be "not modified" in CI - most of these repos have something change every day, due to fields like created_at, updated_at & pushed_at, and forks, open_issues & watchers. So I wouldnt try (yet) to restructure the code to handle this hit being unmodified. However, switching this to graphql could allow the response that might provide an etag that is more stable, and thus worth restructuring the code to avoid the unnecessary refetches.

NobodyXu commented 3 months ago

A very quick improvement for the license file fetches

For license file, can't we just grab it from the tarball downloaded from crates.io?

The tarball on crates.io is immutable so it can be cached for each release, and I think we already download from crates.io?

jayvdb commented 3 months ago

For license file, can't we just grab it from the tarball downloaded from crates.io?

Yup; good idea.

The tarball on crates.io is immutable so it can be cached for each release, and I think we already download from crates.io?

yes we do.

taiki-e commented 3 months ago

Before https://github.com/taiki-e/install-action/commit/a5ddc5a290b17cc3e60dec98179fce77e41ac975: about 11m - 22m After a5ddc5a290b17cc3e60dec98179fce77e41ac975: 4m

GITHUB_TOKEN wasn't applied properly in the first place (I think it worked at one point, but at some point GitHub may have made the validation more strict).

Closing this issue since this is no longer an urgent issue. (Opened separate issue #546 for license file fetch improvements.) However, I always welcome concrete suggestions/patches for further performance improvements.