jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

[INFRA-2720] Incrementals deployment broken with 503 #2293

Closed jenkins-infra-bot closed 4 years ago

jenkins-infra-bot commented 4 years ago

https://ci.jenkins.io/job/Plugins/job/workflow-step-api-plugin/job/PR-58/4/execution/node/109/log/ broke with

 + curl -i -H Content-Type: application/json -d {"build_url":"https://ci.jenkins.io/job/Plugins/job/workflow-step-api-plugin/job/PR-58/4/"} https://jenkins-community-functions.azurewebsites.net/api/incrementals-publisher?clientId=default&code=****
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
  0     0    0     0    0     0      0      0 --:--:--  0:00:05 --:--:--     0
100   120  100    29  100    91      4     15  0:00:07  0:00:05  0:00:02    27
 HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Mon, 24 Aug 2020 18:28:15 GMT

 Function host is not running.

Originally reported by jglick, imported from: Incrementals deployment broken with 503
  • assignee: timja
  • status: Resolved
  • priority: Critical
  • resolution: Fixed
  • resolved: 2020-09-07T22:16:38+02:00
  • imported: 2022/01/10
jenkins-infra-bot commented 4 years ago

slide_o_mix:

I think the Azure function may need some update, it is getting the following error for the specific build you mention above.

2020-08-26T22:21:30Z   [Information]   Downloaded file size 500859
2020-08-26T22:21:30Z   [Information]   Parsed org/jenkins-ci/plugins/workflow/workflow-step-api/2.23-rc565.086a5d679110/workflow-step-api-2.23-rc565.086a5d679110.pom with url=https://github.com/jglick/workflow-step-api-plugin tag=086a5d679110c2999a47e2161cf4f483434dedad GAV=org.jenkins-ci.plugins.workflow:workflow-step-api:2.23-rc565.086a5d679110
2020-08-26T22:21:31Z   [Error]   Invalid archive
2020-08-26T22:21:31Z   [Error]   Error: ZIP error: Error: Wrong URL in /project/scm/url

This happens here: https://github.com/jenkins-infra/community-functions/blob/master/incrementals-publisher/lib/permissions.js#L62

I don't know the full extent of what would need to be updated, but the "url=https://github.com/jglick/workflow-step-api-plugin" is not matching what the Azure function is expecting.

jenkins-infra-bot commented 4 years ago

jglick:

The originally reported outage has apparently gone away. I filed https://github.com/jenkins-infra/community-functions/pull/23 for the abovementioned error.

jenkins-infra-bot commented 4 years ago

jglick:

Now it is broken again with a 503: https://ci.jenkins.io/job/Plugins/job/matrix-auth-plugin/job/PR-85/1/execution/node/115/log/

jenkins-infra-bot commented 4 years ago

slide_o_mix:

When I look at the Azure portal, there are several builds that have succeeded. I wonder if we need to implement a retry option in the curl command that does the publishing.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

I submitted this PR to add retries to the curl commands. I am not sure if this will completely solve the issue, but it may at least help if the 503 is intermittent.

https://github.com/jenkins-infra/pipeline-library/pull/162

jenkins-infra-bot commented 4 years ago

jglick:

Heh. Not exactly:

 + curl --retry 10 --retry-delay 10 -i -H Content-Type: application/json -d {"build_url":"https://ci.jenkins.io/job/Plugins/job/declarative-pipeline-migration-assistant-plugin/job/PR-37/2/"} https://jenkins-community-functions.azurewebsites.net/api/incrementals-publisher?clientId=default&code=****
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
  Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114     94    372 --:--:-- --:--:-- --:--:--   465
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 10 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    420   1652 --:--:-- --:--:-- --:--:--  2072
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 9 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    391   1540 --:--:-- --:--:-- --:--:--  1932
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 8 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    402   1583 --:--:-- --:--:-- --:--:--  1986
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 7 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    391   1540 --:--:-- --:--:-- --:--:--  1958
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 6 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    420   1652 --:--:-- --:--:-- --:--:--  2072
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 5 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    408   1605 --:--:-- --:--:-- --:--:--  2014
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 4 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    273   1075 --:--:-- --:--:-- --:--:--  1349
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 3 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    311   1225 --:--:-- --:--:-- --:--:--  1537
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 2 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    391   1540 --:--:-- --:--:-- --:--:--  1906
100   143  100    29  100   114    391   1540 --:--:-- --:--:-- --:--:--  1906
 Warning: Transient problem: HTTP error Will retry in 10 seconds. 1 retries 
 Warning: left.

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100   143  100    29  100   114    414   1628 --:--:-- --:--:-- --:--:--  2042
 HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:01 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:11 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:21 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:31 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:41 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:21:51 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:22:01 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:22:11 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:22:22 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:22:32 GMT

 Function host is not running.HTTP/1.1 503 Service Unavailable
 Content-Length: 29
 Date: Tue, 01 Sep 2020 21:22:42 GMT

 Function host is not running.
jenkins-infra-bot commented 4 years ago

slide_o_mix:

Well, that's a bummer. I am not sure what else to look at. We may need to wait until someone who knows more about the process is back from vacation.

jenkins-infra-bot commented 4 years ago

jglick:

Which could be weeks away. That is what I was afraid of.

It was working earlier today, then stopped. Seems like this service on Azure just flakes out.

jenkins-infra-bot commented 4 years ago

jglick:

I have a login to portal.azure.com and can see jenkins-community-functions but I think I do not have very good permissions, and I have never managed to get any useful information out of the dozens of screens available there, like—a log showing why my function is down. I can get a web cmd.exe shell in the function, for whatever that is worth. Wish we could switch to GCP Cloud Run or something a bit simpler.

Right now the service is giving me a 401 as expected with a bogus code (I do not know the real code), rather than a 503, so maybe it is back up.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

Agreed, I looked at some metrics and saw that the connection rate went south at around 12:27PM MST. At the same time it looks like 503 errors went up significantly. In looking around at logs I could find, I couldn't really see any sort of info as to WHY it was happening. I will continue looking.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

This build successfully published to incrementals about 10 minutes ago: https://ci.jenkins.io/job/Plugins/job/nodejs-plugin/view/change-requests/job/PR-34/3/console

I see the success in the logs.

jenkins-infra-bot commented 4 years ago

jglick:

Yes, I started that one. Seems to be back up for now.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

Did you notice any failures last night? From looking at the metrics, it looks like it was working all night, but if you noticed any failures, it would be good to know so I can try and correlate with the logs.

jenkins-infra-bot commented 4 years ago

jglick:

Do not think I used it after my last comment. Will post here if I notice something else, with Jenkins build step logs for correlation.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

I'll continue monitoring the metrics and logs on the azure side.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

I do notice several cases where the zip file of artifacts seems to be empty and there is this in the logs:

2020-09-02T13:40:01Z   [Information]   Downloaded https://ci.jenkins.io/job/Plugins/job/credentials-plugin/job/PR-137/26/artifact/**/*-rc*.90eb299e16d3/*-rc*.90eb299e16d3*/*zip*/archive.zip D:\local\Temp\incrementals-HBkjAE\archive.zip
2020-09-02T13:40:01Z   [Information]   Downloaded file size 22
2020-09-02T13:40:01Z   [Error]   Empty archive

When I go to that build, the artifacts have a different hash than what is being requested by the incrementals publisher function on Azure, e.g., credentials-2.3.14-rc867.87f7bb89676b-javadoc.jar

I am not super familiar with incrementals, so I don't know if that is expected or not. I see many of those today. I don't think it is related to this issue, but I am not sure.

jenkins-infra-bot commented 4 years ago

jglick:

That just happens when the PR is not up to date with its base branch IIRC.

jenkins-infra-bot commented 4 years ago

jglick:

https://ci.jenkins.io/job/Core/job/jenkins/job/PR-4848/85/execution/node/181/log/ broke deployment but not due to the function AFAICT; rather looks like an EC2 agent failure.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

Gavin Mogan Has kindly looked at migrating the incrementals-publisher code off of Azure Functions to a simple web app. This should make it more reliable and controllable. He's working on creating a docker image of his implementation that we can publish to dockerhub and then use.

jenkins-infra-bot commented 4 years ago

jglick:

Sounds good. We could run it in the same cluster as ci.jenkins.io I suppose.

jenkins-infra-bot commented 4 years ago

jglick:

https://ci.jenkins.io/job/Plugins/job/blueocean-plugin/job/master/149/execution/node/109/log/ failed with

ENOSPC: no space left on device, mkdtemp 'D:\local\Temp\incrementals-XXXXXX'

It occurs to me that this could also be rewritten to be a bespoke Jenkins plugin with a Pipeline step, rather than a curl call on a node. All of the code which currently queries the Jenkins REST API for build metadata would be replaced by simple Java code, and the deployable artifacts would be available locally. The Artifactory token would need to be kept in SYSTEM-scope credentials so it is not accessible by projects directly. Would be willing to take a stab at writing that if I can get some assurance that an admin is ready to install & configure it and provide me debugging logs.

jenkins-infra-bot commented 4 years ago

slide_o_mix:

I think Gavin Mogan is almost ready with his port to run on the jenkins infra, can we test that before we go down the plugin route?

jenkins-infra-bot commented 4 years ago

jglick:

Oh sure, that be fine too.

jenkins-infra-bot commented 4 years ago

halkeye:

So this is fully deployed - https://github.com/jenkins-infra/pipeline-library/pull/163#pullrequestreview-483168019

Tim Jacomb tested it with theme manager and it seems to work.

We have logs in grafana/loki so we can see when there's issues. When more infra people are back I might even try and setup sentry to get early alerts of issues.

jenkins-infra-bot commented 4 years ago

timja:

Let me know if you see any more issues,

If you want access to logs then request VPN access and say you want access to logs in grafana and we can get that setup.

https://github.com/jenkins-infra/openvpn#howto-get-client-access

jenkins-infra-bot commented 4 years ago

jglick:

Very good news. I doubt I need to get VPN access myself, so long as there are others around who have access to logs and can merge PRs as needed.

jenkins-infra-bot commented 4 years ago

jglick:

Where is the port?

jenkins-infra-bot commented 4 years ago

timja:

Port for what?

jenkins-infra-bot commented 4 years ago

timja:

https://github.com/jenkins-infra/incrementals-publisher

jenkins-infra-bot commented 2 years ago

[Originally blocks: JENKINS-58716]