Closed MarkEWaite closed 3 weeks ago
https://github.com/jenkinsci/acceptance-test-harness/pull/1644 is failing with similar errors even after a retry
Thanks for raising this issue and for the details folks! Datadog also indicates that ACP had issues between 06:00pm UTC an 08:00 pm UTC yesterday (30 July 2024).
Checking the logs in Datadog show there was a lot of HTTP/502 in that time windows. Each (651 precisely) HTTP/502 error reported the following:
22#22: *265332 upstream timed out (110: Operation timed out) while connecting to upstream
The errors are spread across the 2 ACP services:
A few metrics collections regarding the time window of yesterday:
Public ACP
Nodes:
Disk / net metrics clearly shows a period of network activity due to the ACP usage in the time window (both in and out) correlated with a peak of disk reads: confirms it is an ACP related activity.
CPU/memory metrics shows almost nothing (e.g. ACP performs well for these 2 metrics). Note: the "tiny" peak on requests is due to another service than ACP, which was updated (rolling upgrade).
Pods metrics:
ACP only clearly shows the same activity during the time windows with nominal CPU/memory usage:
Ingress metrics shows 2 things:
Almost all their outbound network rate is due to ACP transmlitting data while their inbound rate is only half-passed to ACP. Not sure what kind of traffic is not transmitted (hard to tell other than half of the net. rate which might be a low value).
The peak of network activity is the same pattern as ACP
Private ACP
Pod metrics show the same time window activity with a lower average rate (make sense as only ci.jenkins.io container agents are using this one today: there are less container builds but still a few)
Node metrics (removed the ci.jenkins.io agents nodes) shows the same behavior as for the Public ACP: CPU/memory impact are close to zero, but we clearly see a network rate correlated to this time window (which is epxpected).
What to do from here:
=> scaling it up won't change anything (shared resource for ingress and outbound) => we should find a solution to only use the private ACP and decommission the public ACP
Now that https://github.com/jenkins-infra/helpdesk/issues/4206 has been fixed, the ACP in the cluster publick8s
should behave better.
Next steps:
Update on the ACP private only:
I've successfully set up a temporarily internal LB with an IP in the ci.jio VM agents subnets to reach the private ACP with success
Network Contributor
on both ephemeral agents and kubernetes subnetsapiVersion: v1
kind: Service
metadata:
name: acp
namespace: artifact-caching-proxy
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
service.beta.kubernetes.io/azure-load-balancer-resource-group: "public-jenkins-sponsorship"
service.beta.kubernetes.io/azure-load-balancer-internal-subnet: "public-jenkins-sponsorship-vnet-ci_jenkins_io_agents"
spec:
type: LoadBalancer
ports:
- name: http
port: 8080
protocol: TCP
targetPort: http
selector:
app.kubernetes.io/instance: artifact-caching-proxy
app.kubernetes.io/name: artifact-caching-proxy
Next steps:
publick8s
Update:
Update:
MC.....
RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private A
record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure.external-dns
only to manage private record looks overkill.Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:
Update:
* I failed to solve the egg and chicken problem between Terraform and Kubernetes Management: * With a PLS (and LB) managed by the Kubernetes Service, Terraform need to specify the Azure RM permissions (brefore creating the LB) and the Private Endpoint with DNS configuration and NSG rules (but _after_ LB creation*). If the LB changes its name/setup on Kubernetes, then Terraform will start failing because the Data source won't be updated. * I was succesful though, with a PLS in the Kubernetes Node Resource group (`MC.....` RG) along with is NIC, and defining in Terraform a PLS data source, with an associated endpoint in the other subnet and DNS private `A` record (and NSGs) in the private DNS zone of the Vnet. This setup would potentially be useful if we need to access this ACP in other vnets in Azure. * With a simple "internal LB", we still need to have Terraform to create the DNS record and NSG rules. And installing `external-dns` only to manage private record looks overkill.
Proposal: let's specify the IP for the internal LB on Terraform side and feed it to both Terraform and Kubernetes Management. => the constraints to select the proper IP are:
* Make sure it is in the same subnet as the ci.jio VM agent (easy peasy: we have CIDR!) * Make sure it is available: I chose the antepenultimate IP of the CIDR as the Azure VM Jenkins plugins tends to select available IPs from the lower part of CIDR range. It's not strict, but it is low probability to get an IP on the upper range. * Why not the last one: long habit of network allocating the last IP of range to an appliance... => This pattern makes it easy (no PLS, no NIC, no private endpoint, etc.).
Update: started implementation after a successful manual test.
Update:
=> Tests in progress, let's wait 2 days to see the results before deprovisioning public ACP
Update: more changes
=> Windows VM agents are now using properly the internal ACP as verified in https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/246/pipeline-console/?selected-node=151
Next steps:
Update:
id
of a Maven mirror must not have special characters (WARNING
message found in builds logs of maven builds)=> Now the (private) ACP is used but I was able to reproduce the dreaded (110: Operation timed out)
error quite quickly with the new workload (example: https://ci.jenkins.io/job/Plugins/job/jenkins-infra-test-plugin/job/master/257/).
It should be improved with https://github.com/jenkins-infra/kubernetes-management/pull/5525 (did a lot of tests) which not only uses the local kube DNS by default (to let CoreDNS do its work and benefit from DNS local cache) but also keep using 9.9.9.9 as a fallback.
Let's see the result after a few days. @MarkEWaite @basil @timja don't hesitate to run big builds in the upcomings days so we'll see how the new DNS setup behaves.
I saw impressive results (Linux build from 50s to 30s) on the plugin jenkins-infra-team
but it is not really a real life use case.
We'll check the errors in logs (datadog) and I'll see to add an alert system when there are such errors.
Update:
In the past 48h, the (private) ACP logs show 5 individual errors for ~ 4M successful requests. Ratio is way better than it use to be.
upstream timed out (110: Operation timed out) while connecting to upstream
error thoughan upstream response is buffered to a temporary file /var/cache/nginx/proxy_temp/<...> while reading upstream
). While expected for huge files bigger than the memory buffer window, it could be interesting to avoid writing these to disk. Nice to have improvement?Given the good rate, let's clean up the public ACP resources (as no need to go back):
While it is an improvement, I'm still feeling there might be improvements:
Update:
Service(s)
ci.jenkins.io
Summary
https://ci.jenkins.io/job/Infra/job/pipeline-steps-doc-generator/job/PR-468/1/console failed to build with a report
Reproduction steps