jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
16 stars 10 forks source link

Bad gateway message fails some ci.jenkins.io builds #4204

Closed MarkEWaite closed 3 weeks ago

MarkEWaite commented 1 month ago

Service(s)

ci.jenkins.io

Summary

https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report

13:20:22  Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project org.jenkins-ci:pipeline-steps-doc-generator:jar:1.0-SNAPSHOT
13:20:22  dependency: org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 (compile)
13:20:22    Could not transfer artifact org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)
13:20:22  dependency: org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 (compile)
13:20:22    Could not transfer artifact org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)

Reproduction steps

  1. Open the failing build, confirm that the build failed due to failure to resolve a dependency from https://repo.azure.jenkins.io/
basil commented 1 month ago

https://github.com/jenkinsci/acceptance-test-harness/pull/1644 is failing with similar errors even after a retry

dduportal commented 1 month ago

Thanks for raising this issue and for the details folks! Datadog also indicates that ACP had issues between 06:00pm UTC an 08:00 pm UTC yesterday (30 July 2024).

Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows. Each (651 precisely) HTTP/502 error reported the following:

22#22: *265332 upstream timed out (110: Operation timed out) while connecting to upstream

The errors are spread across the 2 ACP services:

dduportal commented 1 month ago
dduportal commented 1 month ago

A few metrics collections regarding the time window of yesterday:

Public ACP

dduportal commented 1 month ago

Private ACP

dduportal commented 1 month ago

What to do from here:

=> scaling it up won't change anything (shared resource for ingress and outbound) => we should find a solution to only use the private ACP and decommission the public ACP

dduportal commented 1 month ago

Now that https://github.com/jenkins-infra/helpdesk/issues/4206 has been fixed, the ACP in the cluster publick8s should behave better.

Next steps:

dduportal commented 1 month ago

Update on the ACP private only:

Next steps:

  1. Persist in Terraform the role assignement and NSG rules
  2. Check if we can use a Private link + endpoint defined in Terraform (with a DNS record) as per https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#create-a-private-endpoint-to-the-private-link-service. It looks like the Kubernetes Service takes care of creating and managing the PLS at first sight, I need to try to create one in Terraform and specify it to see if it reconciles or not
  3. Once we have a PLS or a static IP, we set up the DNS record (and eventually update the NSG rules if the IP was changed from step 1.)
  4. set up ci.jio Azure VM agents to use this new DNS with HTTP and port 8080 (instead of the https public ACP)
  5. Verify it works. If yes, then deprovision the public ACP from publick8s
dduportal commented 1 month ago

Update:

dduportal commented 1 month ago

Update:

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

dduportal commented 1 month ago

Update:

* I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:

  * With a PLS (and LB)  managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but _after_ LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.

    * I was succesful though, with a PLS in the Kubernetes Node Resource group (`MC.....` RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private `A` record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
  * With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing `external-dns` only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

* Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)

* Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.

  * Why not the last one: long habit of network allocating the last IP of range to an appliance...
    => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

Update: started implementation after a successful manual test.

dduportal commented 1 month ago

Update:

=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP

dduportal commented 1 month ago

Update: more changes

=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151

Next steps:

dduportal commented 4 weeks ago

Update:

=> Now the (private) ACP is used but I was able to reproduce the dreaded (110: Operation timed out) error quite quickly with the new workload (example: https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/257/).

It should be improved with https://github.com/jenkins-infra/kubernetes-management/pull/5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback.

dduportal commented 4 weeks ago

Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves.

I saw impressive results (Linux build from 50s to 30s) on the plugin jenkins-infra-team but it is not really a real life use case.

We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors.

dduportal commented 3 weeks ago

Update:

While it is an improvement, I'm still feeling there might be improvements:

dduportal commented 3 weeks ago

For info: https://github.com/jenkins-infra/helpdesk/issues/4241

dduportal commented 3 weeks ago

Update: