Closed jenkins-infra-bot closed 4 years ago
I think the Azure function may need some update, it is getting the following error for the specific build you mention above.
2020-08-26T22:21:30Z [Information] Downloaded file size 500859 2020-08-26T22:21:30Z [Information] Parsed org/jenkins-ci/plugins/workflow/workflow-step-api/2.23-rc565.086a5d679110/workflow-step-api-2.23-rc565.086a5d679110.pom with url=https://github.com/jglick/workflow-step-api-plugin tag=086a5d679110c2999a47e2161cf4f483434dedad GAV=org.jenkins-ci.plugins.workflow:workflow-step-api:2.23-rc565.086a5d679110 2020-08-26T22:21:31Z [Error] Invalid archive 2020-08-26T22:21:31Z [Error] Error: ZIP error: Error: Wrong URL in /project/scm/url
This happens here: https://github.com/jenkins-infra/community-functions/blob/master/incrementals-publisher/lib/permissions.js#L62
I don't know the full extent of what would need to be updated, but the "url=https://github.com/jglick/workflow-step-api-plugin" is not matching what the Azure function is expecting.
The originally reported outage has apparently gone away. I filed https://github.com/jenkins-infra/community-functions/pull/23 for the abovementioned error.
Now it is broken again with a 503: https://ci.jenkins.io/job/Plugins/job/matrix-auth-plugin/job/PR-85/1/execution/node/115/log/
When I look at the Azure portal, there are several builds that have succeeded. I wonder if we need to implement a retry option in the curl command that does the publishing.
I submitted this PR to add retries to the curl commands. I am not sure if this will completely solve the issue, but it may at least help if the 503 is intermittent.
Heh. Not exactly:
+ curl --retry 10 --retry-delay 10 -i -H Content-Type: application/json -d {"build_url":"https://ci.jenkins.io/job/Plugins/job/declarative-pipeline-migration-assistant-plugin/job/PR-37/2/"} https://jenkins-community-functions.azurewebsites.net/api/incrementals-publisher?clientId=default&code=**** % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 94 372 --:--:-- --:--:-- --:--:-- 465 Warning: Transient problem: HTTP error Will retry in 10 seconds. 10 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 420 1652 --:--:-- --:--:-- --:--:-- 2072 Warning: Transient problem: HTTP error Will retry in 10 seconds. 9 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 391 1540 --:--:-- --:--:-- --:--:-- 1932 Warning: Transient problem: HTTP error Will retry in 10 seconds. 8 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 402 1583 --:--:-- --:--:-- --:--:-- 1986 Warning: Transient problem: HTTP error Will retry in 10 seconds. 7 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 391 1540 --:--:-- --:--:-- --:--:-- 1958 Warning: Transient problem: HTTP error Will retry in 10 seconds. 6 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 420 1652 --:--:-- --:--:-- --:--:-- 2072 Warning: Transient problem: HTTP error Will retry in 10 seconds. 5 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 408 1605 --:--:-- --:--:-- --:--:-- 2014 Warning: Transient problem: HTTP error Will retry in 10 seconds. 4 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 273 1075 --:--:-- --:--:-- --:--:-- 1349 Warning: Transient problem: HTTP error Will retry in 10 seconds. 3 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 311 1225 --:--:-- --:--:-- --:--:-- 1537 Warning: Transient problem: HTTP error Will retry in 10 seconds. 2 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 391 1540 --:--:-- --:--:-- --:--:-- 1906 100 143 100 29 100 114 391 1540 --:--:-- --:--:-- --:--:-- 1906 Warning: Transient problem: HTTP error Will retry in 10 seconds. 1 retries Warning: left. 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 143 100 29 100 114 414 1628 --:--:-- --:--:-- --:--:-- 2042 HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:01 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:11 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:21 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:31 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:41 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:21:51 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:22:01 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:22:11 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:22:22 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:22:32 GMT Function host is not running.HTTP/1.1 503 Service Unavailable Content-Length: 29 Date: Tue, 01 Sep 2020 21:22:42 GMT Function host is not running.
Well, that's a bummer. I am not sure what else to look at. We may need to wait until someone who knows more about the process is back from vacation.
Which could be weeks away. That is what I was afraid of.
It was working earlier today, then stopped. Seems like this service on Azure just flakes out.
I have a login to portal.azure.com and can see jenkins-community-functions but I think I do not have very good permissions, and I have never managed to get any useful information out of the dozens of screens available there, like—a log showing why my function is down. I can get a web cmd.exe shell in the function, for whatever that is worth. Wish we could switch to GCP Cloud Run or something a bit simpler.
Right now the service is giving me a 401 as expected with a bogus code (I do not know the real code), rather than a 503, so maybe it is back up.
Agreed, I looked at some metrics and saw that the connection rate went south at around 12:27PM MST. At the same time it looks like 503 errors went up significantly. In looking around at logs I could find, I couldn't really see any sort of info as to WHY it was happening. I will continue looking.
This build successfully published to incrementals about 10 minutes ago: https://ci.jenkins.io/job/Plugins/job/nodejs-plugin/view/change-requests/job/PR-34/3/console
I see the success in the logs.
Yes, I started that one. Seems to be back up for now.
Did you notice any failures last night? From looking at the metrics, it looks like it was working all night, but if you noticed any failures, it would be good to know so I can try and correlate with the logs.
Do not think I used it after my last comment. Will post here if I notice something else, with Jenkins build step logs for correlation.
I'll continue monitoring the metrics and logs on the azure side.
I do notice several cases where the zip file of artifacts seems to be empty and there is this in the logs:
2020-09-02T13:40:01Z [Information] Downloaded https://ci.jenkins.io/job/Plugins/job/credentials-plugin/job/PR-137/26/artifact/**/*-rc*.90eb299e16d3/*-rc*.90eb299e16d3*/*zip*/archive.zip D:\local\Temp\incrementals-HBkjAE\archive.zip 2020-09-02T13:40:01Z [Information] Downloaded file size 22 2020-09-02T13:40:01Z [Error] Empty archive
When I go to that build, the artifacts have a different hash than what is being requested by the incrementals publisher function on Azure, e.g., credentials-2.3.14-rc867.87f7bb89676b-javadoc.jar
I am not super familiar with incrementals, so I don't know if that is expected or not. I see many of those today. I don't think it is related to this issue, but I am not sure.
That just happens when the PR is not up to date with its base branch IIRC.
https://ci.jenkins.io/job/Core/job/jenkins/job/PR-4848/85/execution/node/181/log/ broke deployment but not due to the function AFAICT; rather looks like an EC2 agent failure.
Gavin Mogan Has kindly looked at migrating the incrementals-publisher code off of Azure Functions to a simple web app. This should make it more reliable and controllable. He's working on creating a docker image of his implementation that we can publish to dockerhub and then use.
Sounds good. We could run it in the same cluster as ci.jenkins.io I suppose.
https://ci.jenkins.io/job/Plugins/job/blueocean-plugin/job/master/149/execution/node/109/log/ failed with
ENOSPC: no space left on device, mkdtemp 'D:\local\Temp\incrementals-XXXXXX'
It occurs to me that this could also be rewritten to be a bespoke Jenkins plugin with a Pipeline step, rather than a curl call on a node. All of the code which currently queries the Jenkins REST API for build metadata would be replaced by simple Java code, and the deployable artifacts would be available locally. The Artifactory token would need to be kept in SYSTEM-scope credentials so it is not accessible by projects directly. Would be willing to take a stab at writing that if I can get some assurance that an admin is ready to install & configure it and provide me debugging logs.
I think Gavin Mogan is almost ready with his port to run on the jenkins infra, can we test that before we go down the plugin route?
Oh sure, that be fine too.
So this is fully deployed - https://github.com/jenkins-infra/pipeline-library/pull/163#pullrequestreview-483168019
Tim Jacomb tested it with theme manager and it seems to work.
We have logs in grafana/loki so we can see when there's issues. When more infra people are back I might even try and setup sentry to get early alerts of issues.
Let me know if you see any more issues,
If you want access to logs then request VPN access and say you want access to logs in grafana and we can get that setup.
https://github.com/jenkins-infra/openvpn#howto-get-client-access
Very good news. I doubt I need to get VPN access myself, so long as there are others around who have access to logs and can merge PRs as needed.
Where is the port?
Port for what?
[Originally blocks: JENKINS-58716]
https://ci.jenkins.io/job/Plugins/job/workflow-step-api-plugin/job/PR-58/4/execution/node/109/log/ broke with
Originally reported by jglick, imported from: Incrementals deployment broken with 503