Bad gateway message fails some ci.jenkins.io builds

MarkEWaite commented 1 month ago

Service(s)

ci.jenkins.io

Summary

https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report

13:20:22  Caused by: org.apache.maven.project.DependencyResolutionException: Could not resolve dependencies for project org.jenkins-ci:pipeline-steps-doc-generator:jar:1.0-SNAPSHOT
13:20:22  dependency: org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 (compile)
13:20:22    Could not transfer artifact org.jenkins-ci.plugins.workflow:workflow-api:jar:1322.v857eeeea_9902 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)
13:20:22  dependency: org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 (compile)
13:20:22    Could not transfer artifact org.jenkins-ci.plugins:scm-api:jar:690.vfc8b_54395023 from/to azure-proxy (https://repo.azure.jenkins.io/): status code: 502, reason phrase: Bad Gateway (502)

Reproduction steps

Open the failing build, confirm that the build failed due to failure to resolve a dependency from https://repo.azure.jenkins.io/

basil commented 1 month ago

https://github.com/jenkinsci/acceptance-test-harness/pull/1644 is failing with similar errors even after a retry

dduportal commented 1 month ago

Thanks for raising this issue and for the details folks! Datadog also indicates that ACP had issues between 06:00pm UTC an 08:00 pm UTC yesterday (30 July 2024).

Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows. Each (651 precisely) HTTP/502 error reported the following:

22#22: *265332 upstream timed out (110: Operation timed out) while connecting to upstream

The errors are spread across the 2 ACP services:

646 (99%) were on the publick8s ACP (https://repo.azure.jenkins.io), which uses the
5 (<1%) were on the private Azure ACP (http://artifact-caching-proxy.artifact-caching-proxy:8080/)

dduportal commented 1 month ago

The fact that the error happens on 2 distinct AKS clusters, with 2 distinct node operating systems (Ubuntu / Azure Linux), on 2 distinct network with 2 distinct outbound methods shows that there has been an issue outside of our subscriptions (either on Artifactory side or on the public Azure network) OR we have a general ACP issue (in the way we setup Nginx to manage outbound connections).
The problem on the public ACP is clearly made worse by https://github.com/jenkins-infra/helpdesk/issues/4206 => that explains the 99% / 1%

dduportal commented 1 month ago

A few metrics collections regarding the time window of yesterday:

Public ACP

Nodes:
- Disk / net metrics clearly shows a period of network activity due to the ACP usage in the time window (both in and out) correlated with a peak of disk reads: confirms it is an ACP related activity.
- CPU/memory metrics shows almost nothing (e.g. ACP performs well for these 2 metrics). Note: the "tiny" peak on requests is due to another service than ACP, which was updated (rolling upgrade).
Pods metrics:
- ACP only clearly shows the same activity during the time windows with nominal CPU/memory usage:
- Ingress metrics shows 2 things:
- Almost all their outbound network rate is due to ACP transmlitting data while their inbound rate is only half-passed to ACP. Not sure what kind of traffic is not transmitted (hard to tell other than half of the net. rate which might be a low value).
- The peak of network activity is the same pattern as ACP

dduportal commented 1 month ago

Private ACP

Pod metrics show the same time window activity with a lower average rate (make sense as only ci.jenkins.io container agents are using this one today: there are less container builds but still a few)
Node metrics (removed the ci.jenkins.io agents nodes) shows the same behavior as for the Public ACP: CPU/memory impact are close to zero, but we clearly see a network rate correlated to this time window (which is epxpected).

dduportal commented 1 month ago

What to do from here:

The public ACP service is clearly impaired by 2 things:
- The current publick8s outbound issue
- The (shared) usage of the ingress

=> scaling it up won't change anything (shared resource for ingress and outbound) => we should find a solution to only use the private ACP and decommission the public ACP

The private ACP service had a few errors which need investigating (ACP/nginx level) and we have a short term improvement to make:
- As per https://github.com/jenkins-infra/helpdesk/issues/4192 issues, spreading the outbound traffic over more IPs would limit the side effects of remote services using strict rate limit anti abuse systems.

dduportal commented 1 month ago

Now that https://github.com/jenkins-infra/helpdesk/issues/4206 has been fixed, the ACP in the cluster publick8s should behave better.

Next steps:

Migrate all ACP workload to the private (HTTP only) ACP. Requires setting up an internal LB, associate NSG rules with it, test from VM agents and set up ci.jenkins.io
- No more ingress in the middle, no more TLS, no more "public" access requirement
Deprovision the "public" ACP
- Less money to spend on Azure CDF
- Less impact on the public cluster with a stricter partition between public services and ci.jenkins.io itself
Watch the (private) ACP activity: if it also has errors, then we'll need to setup DNS caching inside the ci.jenkins.io-agents-1 cluster (https://github.com/Azure/AKS/issues/3673)

dduportal commented 1 month ago

Update on the ACP private only:

I've successfully set up a temporarily internal LB with an IP in the ci.jio VM agents subnets to reach the private ACP with success
- Required to create a few Azure RM resources:
- 2 roles assignments (to allow the AKS identity to have Network Contributor on both ephemeral agents and kubernetes subnets
- NSG rules (1 in and 1 out) to allow ephemeral agents to reach the internal LB
- Then created the temp. LB with the following YAML:
```
apiVersion: v1
kind: Service
metadata:
  name: acp
  namespace: artifact-caching-proxy
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
    service.beta.kubernetes.io/azure-load-balancer-resource-group: "public-jenkins-sponsorship"
    service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "public-jenkins-sponsorship-vnet-ci_jenkins_io_agents"
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 8080
      protocol: TCP
      targetPort: http
  selector:
    app.kubernetes.io/instance: artifact-caching-proxy
    app.kubernetes.io/name: artifact-caching-proxy
```
- Next problem will be to specify the DNS record with the allocated IP. I would like to define the IP somehow in Terraform and pass it to the kubernetes-management (through the ACP chart values with annotations)

Next steps:

Persist in Terraform the role assignement and NSG rules
Check if we can use a Private link + endpoint defined in Terraform (with a DNS record) as per https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#create-a-private-endpoint-to-the-private-link-service. It looks like the Kubernetes Service takes care of creating and managing the PLS at first sight, I need to try to create one in Terraform and specify it to see if it reconciles or not
Once we have a PLS or a static IP, we set up the DNS record (and eventually update the NSG rules if the IP was changed from step 1.)
set up ci.jio Azure VM agents to use this new DNS with HTTP and port 8080 (instead of the https public ACP)
Verify it works. If yes, then deprovision the public ACP from publick8s

dduportal commented 1 month ago

Update:

Tried creating a Private Link Service from the AKS Service LB: it works very well when using the annotations from https://learn.microsoft.com/en-us/azure/aks/internal-lb?tabs=set-service-annotations#pls-customizations-via-annotations and setting up the proper Azure Role permissions.
I'm working on the egg-and-chicken problem between Terraform and AKS, in order to create a set of "private end point to this PLS + DNS + automated NSGs"

dduportal commented 1 month ago

Update:

I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:
- With a PLS (and LB) managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but after LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.
- I was succesful though, with a PLS in the Kubernetes Node Resource group (MC..... RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private A record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
- With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing external-dns only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)
Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.
- Why not the last one: long habit of network allocating the last IP of range to an appliance... => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

dduportal commented 1 month ago

Update:

* I failed to solve the egg and chicken problem between Terraform and Kubernetes Management:

  * With a PLS (and LB)  managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but _after_ LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated.

    * I was succesful though, with a PLS in the Kubernetes Node Resource group (`MC.....` RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private `A` record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.
  * With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing `external-dns` only to manage private record looks overkill.

Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:

* Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!)

* Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range.

  * Why not the last one: long habit of network allocating the last IP of range to an appliance...
    => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).

Update: started implementation after a successful manual test.

https://github.com/jenkins-infra/azure/pull/798 to create resources in Terraform + generate automated report
Created https://github.com/jenkins-infra/helm-charts/pull/1257 to allow specifying custom annotations to ACP chart

dduportal commented 1 month ago

Update:

ci.jenkins.io has been set up with only the internal ACP (through the private DNS / private LB)
- https://github.com/jenkins-infra/jenkins-infra/pull/3591
- cleanup: https://github.com/jenkins-infra/jenkins-infra/pull/3592 which also disable ACP for s390x
Tested with both Linux/Windows, VM and containers in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/
Some cases were failing: had to fix the pipeline library with:
- short term hotfix: https://github.com/jenkins-infra/pipeline-library/pull/876
- cleanup (with code simplification and support of dynamic retrieval of the ACP URL): https://github.com/jenkins-infra/pipeline-library/pull/877
Cleanup of the public ACP:
- Set to zero replica (until Monday)
- Removed the credential in ci.jenkins.io

=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP

dduportal commented 1 month ago

Update: more changes

Cleaned up the monitoring as we do not want public ACP anymore (and it's scaled to zero): https://github.com/jenkins-infra/datadog/pull/257
Set up the ACP environment variables on Windows VM Azure agent (inbound) used in ci.jenkins.io: https://github.com/jenkins-infra/jenkins-infra/pull/3594

=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151

Next steps:

Fix ACI configuration to use private network (prototyped manually and it worked)
- Require a dedicated subnet to ensure delegation to ACI
Cleanup public ACP

dduportal commented 4 weeks ago

Update:

ACI agents are now using private IP in a private subnet and they can reach successfully the ACP!
Fixup for the settings.xml Config file as the id of a Maven mirror must not have special characters (WARNING message found in builds logs of maven builds)

=> Now the (private) ACP is used but I was able to reproduce the dreaded (110: Operation timed out) error quite quickly with the new workload (example: https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/257/).

It should be improved with https://github.com/jenkins-infra/kubernetes-management/pull/5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback.

dduportal commented 4 weeks ago

Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves.

I saw impressive results (Linux build from 50s to 30s) on the plugin jenkins-infra-team but it is not really a real life use case.

We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors.

dduportal commented 3 weeks ago

Update:

In the past 48h, the (private) ACP logs show 5 individual errors for ~ 4M successful requests. Ratio is way better than it use to be.
- 4 errors are HTTP/503 errors on the upstream (e.g. on Artifactory) side. They all happened after a HTTP/302 redirect. => these errors are not on our side alas.
- But we still had a upstream timed out (110: Operation timed out) while connecting to upstream error though
- Note: we have a huge amount of warning messages (~2M) about buffered-to-file responses (an upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream). While expected for huge files bigger than the memory buffer window, it could be interesting to avoid writing these to disk. Nice to have improvement?
Given the good rate, let's clean up the public ACP resources (as no need to go back):
- Remove resource management from jenkins-infra/kubernetes-management
- Then remove the namespace manually
- Finally clean up Terraform resources: https://github.com/jenkins-infra/azure/blob/main/repo.azure.jenkins.io.tf

While it is an improvement, I'm still feeling there might be improvements:

On client (Maven) side: https://stackoverflow.com/questions/55899091/maven-retry-dependency-download-if-failed
On ACP side (using Nginx upstream retries): http://nginx.org/en/docs/http/ngx_http_upstream_module.html

dduportal commented 3 weeks ago

For info: https://github.com/jenkins-infra/helpdesk/issues/4241

dduportal commented 3 weeks ago

Update:

The HTTP/500 did not re-appeared. Let's consider the case closed until they re-appear again. The results with the new ACP are quite good:
- Output of 1.3 Gbit/s with the internal BOM builds in #4241
- Output of 700 Mb/s with the last ATH
- Error rate on the past 4 days has been 6 errors for 12 millions successful (HTTP/200 served from cache) requests.
Cleaned up the old public ACP instance:

jenkins-infra / helpdesk