kubernetes / org

Meta configuration for Kubernetes Github Org
Apache License 2.0
242 stars 680 forks source link

REQUEST: New Shared Read Only GitHub Token For Jobs #4433

Closed MadhavJivrajani closed 5 months ago

MadhavJivrajani commented 1 year ago

Context

There has been an uptick in projects seeing flakiness in their tests running in either the GKE or EKS clusters due to the tests hitting GitHub rate limits.

Issues and discussions around this:

It seems like these jobs pull artifacts/files needed using the GitHub APIs. Its worth noting that cloning repositories itself does not count against the rate limit: https://github.com/orgs/community/discussions/44515.

It has also been reported that this issue of rate limiting has been exacerbated by moving jobs to the EKS cluster: https://github.com/kubernetes/org/issues/4165#issuecomment-1676985033. This isn't surprising since nodes on EKS clusters have private IPs and the traffic is egress-ing through NAT and Internet gateways because of which the number of IPs hitting GitHub is very low. There is discussion in SIG K8s Infra around assigning public IPs to nodes, similar to how GKE does it: https://github.com/kubernetes/k8s.io/issues/5759, this would not only help GitHub rate limits, but also rate limits observed while pulling from Docker registry.

However, moving to public IPs also might not prove sufficient for the GitHub rate limit issue for pull heavy jobs since some of them have been experiencing this issue even on the GKE cluster (https://github.com/kubernetes/org/issues/4165).

It has also been suggested to use ghproxy to get around this GitHub rate limit issue: https://github.com/kubernetes/org/issues/4165#issuecomment-1525504515. However, the issue with this is implementations of non-prow clients might have to be significantly changed to adapt to ghproxy (https://github.com/kubernetes/org/issues/4165#issuecomment-1525516684).

Proposal

This issue hopes to track the discussion and decision around creating a new read only GitHub token that can be shared (similar to how some jobs re-use bot tokens) by projects with read heavy jobs against the GitHub API.

Authenticated requests have a rate limit of 5000 requests/hour/account, which should be a sufficient aggregate limit.

Prior art: https://github.com/kubernetes/k8s.io/pull/4259

/sig k8s-infra testing contributor-experience /area github-management /cc @ameukam @xmudrii @kubernetes/owners

sbueringer commented 1 year ago

Thank you very much for opening this issue.

Just slightly more context. In Cluster API we had flaky jobs for the last 1-2 years, it was just at a rate of ~ <5% so we didn't push the GitHub token issue with the highest priority (so I assume more IPs alone wouldn't help, as you wrote).

sbueringer commented 1 year ago

If I'm connecting the dots correctly we would provide the token to the ProwJobs via ExternalSecrets. As far as I'm aware ExternalSecrets are not yet available on the EKS clusters (but there is or will be a discussion about that, cc @ameukam)

xmudrii commented 1 year ago

If I'm connecting the dots correctly we would provide the token to the ProwJobs via ExternalSecrets. As far as I'm aware ExternalSecrets are not yet available on the EKS clusters (but there is or will be a discussion about that, cc @ameukam)

ExternalSecrets are available in the EKS Prow build cluster. We can source secrets from the AWS Secrets Manager in the Prow AWS account and from the GCP Secrets Manager in the k8s-infra-prow-build account.

xmudrii commented 1 year ago

Update: we migrated all nodes in the EKS Prow build cluster to a public subnet, so all nodes have public IP addresses instead of routing all traffic via a NAT Gateway. That should significantly improve the situation, but if you still see increased failure rate due to rate limits, please let us know.

Priyankasaggu11929 commented 12 months ago

Just adding here for record -- discussion from k8s slack channel #sig-k8s-infra around Nodes are randomly freezing and failing - https://kubernetes.slack.com/archives/CCK68P2Q2/p1693476605123389

ameukam commented 11 months ago

We should probably create a new bot. (k8s-contribex-ci-robot? ) operated by the github admin team to provide tokens requested by the community.

mrbobbytables commented 11 months ago

I don't think we need a separate account for this. With the changes made to the eks cluster and switching to authenticated requests we realistically won't have other requests. We'll just have this one option for people to use for authenticated read-only requests.

MadhavJivrajani commented 11 months ago

I think we can create a new account if more requests like this come up. @ameukam can we generate a token from the k8s-infra-ci-robot for this? The governance of this token can be under the purview of github-admins if needed. Thoughts?

k8s-triage-robot commented 7 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 6 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 5 months ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/org/issues/4433#issuecomment-2025206923): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.