openshiftio / openshift.io

Red Hat OpenShift.io is an end-to-end development environment for planning, building and deploying modern applications.
https://openshift.io
97 stars 66 forks source link

If a user has multiple builds running at the same time, one or both will fail ("Unable to build the image using the OpenShift build service") #2729

Closed ldimaggi closed 6 years ago

ldimaggi commented 6 years ago

Steps to recreate:

One or both of the pipelines will fail with this error:

EXITCODE   0[ERROR] F8: Failed to execute the build [Unable to build the image using the OpenShift build service]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:38 min
[INFO] Finished at: 2018-03-22T15:17:00+00:00
[INFO] Final Memory: 36M/53M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar221521731018217: Failed to execute the build: Unable to build the image using the OpenShift build service: An error has occurred. timeout: Socket closed -> [Help 1]
ldimaggi commented 6 years ago

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal io.fabric8:fabric8-maven-plugin:3.5.38:build (fmp) on project testmar221521731018217: Failed to execute the build
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
    at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
    at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
    at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
    at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
    at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoExecutionException: Failed to execute the build
    at io.fabric8.maven.plugin.mojo.build.BuildMojo.buildAndTag(BuildMojo.java:270)
    at io.fabric8.maven.docker.BuildMojo.executeInternal(BuildMojo.java:44)
    at io.fabric8.maven.plugin.mojo.build.BuildMojo.executeInternal(BuildMojo.java:228)
    at io.fabric8.maven.docker.AbstractDockerMojo.execute(AbstractDockerMojo.java:223)
    at io.fabric8.maven.plugin.mojo.build.BuildMojo.execute(BuildMojo.java:199)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
    ... 20 more
Caused by: io.fabric8.maven.core.service.Fabric8ServiceException: Unable to build the image using the OpenShift build service
    at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.build(OpenshiftBuildService.java:121)
    at io.fabric8.maven.plugin.mojo.build.BuildMojo.buildAndTag(BuildMojo.java:267)
    ... 26 more
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:62)
    at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
    at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromInputStream(BuildConfigOperationsImpl.java:276)
    at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:231)
    at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromFile(BuildConfigOperationsImpl.java:68)
    at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.startBuild(OpenshiftBuildService.java:361)
    at io.fabric8.maven.core.service.openshift.OpenshiftBuildService.build(OpenshiftBuildService.java:111)
    ... 27 more
Caused by: java.net.SocketTimeoutException: timeout
    at okio.Okio$4.newTimeoutException(Okio.java:230)
    at okio.AsyncTimeout.exit(AsyncTimeout.java:285)
    at okio.AsyncTimeout$2.read(AsyncTimeout.java:241)
    at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345)
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217)
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211)
    at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:187)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:61)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:125)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at io.fabric8.openshift.client.internal.OpenShiftOAuthInterceptor.intercept(OpenShiftOAuthInterceptor.java:66)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
    at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200)
    at okhttp3.RealCall.execute(RealCall.java:77)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:377)
    at io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:359)
    at io.fabric8.openshift.client.dsl.internal.BuildConfigOperationsImpl.fromInputStream(BuildConfigOperationsImpl.java:274)
    ... 31 more
Caused by: java.net.SocketException: Socket closed
    at java.net.SocketInputStream.read(SocketInputStream.java:204)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:940)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at okio.Okio$2.read(Okio.java:139)
    at okio.AsyncTimeout$2.read(AsyncTimeout.java:237)
    ... 56 more
krishnapaparaju commented 6 years ago

@chmouel Does currently OSIO support parallel builds for a tenant ? @kishansagathiya Is there a way to figure out for a tenant, if a OSIO build is in progress ?

chmouel commented 6 years ago

@krishnapaparaju AFAIK no, @kbsingh sometime ago pointed it to me,

ldimaggi commented 6 years ago

What makes this problem difficult for users is that the competing builds can be in different spaces, so that it is easy to overlook one or more of them.

ppitonak commented 6 years ago

I don't think that this issue is occurring only when multiple builds are running at the same time. I reset my account, created a new project and build failed.

ldimaggi commented 6 years ago

I have seen that too - are you sure that all the build configs were "cleaned out" of OSO by the reset? That's just a guess - but I think I saw a build config not be cleaned by the reset.

On Mon, Mar 26, 2018 at 9:45 AM, Pavol Pitonak notifications@github.com wrote:

I don't think that this issue is occurring only when multiple builds are running at the same time. I reset my account, created a new project and build failed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openshiftio/openshift.io/issues/2729#issuecomment-376170956, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnOPY1DnR4KI1NhGTid2zVN1FOUxTPOks5tiPD3gaJpZM4S3VPG .

-- Len DiMaggio (ldimaggi@redhat.com) JBoss by Red Hat 314 Littleton Road Westford, MA 01886 USA tel: 978.392.3179 cell: 781.472.9912 http://www.redhat.com http://community.jboss.org/people/ldimaggio

hrishin commented 6 years ago

What makes this problem difficult for users is that the competing builds can be in different spaces, so that it is easy to overlook one or more of them.

Even though user run the two builds in the same space it will fall into the same issue.

Issue

The primary issue is with resource quota limitation on OSO when more than two build starts running. Build try to spin up at least 4 pods i.e. 2 Jenkins slave pods and 2 build pods.

Eventually, build throws following status event on OSO

:58:33 PM   Warning Failed Create   Error creating: pods "multi1build-s2i-1-build" is forbidden: exceeded quota: compute-resources-timebound, requested: limits.cpu=1,limits.memory=512Mi, used: limits.cpu=3500m,limits.memory=1792Mi, limited: limits.cpu=4,limits.memory=2Gi

After some time build status become image and it results into

Caused by: java.net.SocketTimeoutException: timeout

Edit: Is it worth to switch from Deployment to DeploymentConfig? https://github.com/fabric8io/fabric8-maven-plugin/issues/1042 to reduce pod consumption in the build process

maxandersen commented 6 years ago

lets make sure we are fixing this at the right level. Parallel builds should be supported on OSIO; no question about that - but users might not have enough resources to do so before upgrading.

In build.next we can hopefully manage this for the user; but until then would this be a matter of enabling something like https://wiki.jenkins.io/display/JENKINS/Throttle+Concurrent+Builds+Plugin or similar to by default have limit of concurrent builds to be 1 and everything else gets queued ?

hrishin commented 6 years ago

In build.next we can hopefully manage this for the user; but until then would this be a matter of enabling something like https://wiki.jenkins.io/display/JENKINS/Throttle+Concurrent+Builds+Plugin or similar to by

Queuing and enforcing build execution policies could be useful for build.next as well. Thanks for this plugin, will evaluate it for our use case.

hrishin commented 6 years ago

Update

We've evaluated the Jenkins throttling plugin. Its possible to throttle the parallel builds. Need to consider following facts.

At the end if two concurrent builds starts runing this is how pending job looks like. image

maxandersen commented 6 years ago

unfortunate if the throttling has to be defined by the user in their jobs - then that kinda defeats the purpose.

hrishin commented 6 years ago

Update:

We have evaluated multiple approaches to limit concurrent build for the time being

  1. Jenkins throttling plugin
  2. Changing Jenkins configuration to limit the concurrent build execution
  3. Using jenkins proxy

Approach 2 is more feasible option where in we are changing <instanceCap>1</instanceCap> (https://github.com/fabric8-services/fabric8-tenant-jenkins/blob/master/apps/jenkins/src/main/fabric8/openshift-cm.yml#L322) with this only one job will be triggered and all other jobs will be queued which will not cause failures of build due to timeouts.

hrishin commented 6 years ago

Update:

Setting <containerCap>1</containerCap> parameter to 1 is limiting the one slave pod at a time while <instanceCap> is for limit the number of slaves to spin up. Setting <containerCap> to 1 will fix the #2384 as well for time being.

Will submit the PR to fix this issue.

rohanKanojia commented 6 years ago

So I've been testing this <containerCap> changes in jenkins, this is what I did:

1.) login to openshift console of osio 2.) change config.xml in jenkins pod running in $USER-jenkins namespace and set <containerCap> flag to 1 3.) Scale up and down it's replication controller: oc scale --replicas=1 replicationcontroller jenkins-1 4.) Try to run two quickstarts concurrently

Here are my test results:

Build 1 : First Build: [:heavy_check_mark:] Second build: [:heavy_check_mark:] Build 2 : First Build: [:x:] (Maximum threads reached error.) Second build: [:heavy_check_mark:] Build 3 : First Build: [:x:] (Service a/c revoked error) Second build: [:x:] (Socket timeout Exception) Build 4 : First Build: [:x:] (Socket timeout Exception) Second build: [:x:] (Socket timeout Exception) --- After resetting environment --- Build 5 : First Build: [:x:] (Socket timeout Exception) Second build: [:x:] (Socket timeout Exception) Build 6 : First Build: [:x:] (Socket timeout Exception) Second build: [:x:] (Socket timeout Exception)

I feel that setting <containerCap> flag is not able to fix the build

ldimaggi commented 6 years ago

So - our conclusion is - what? That we should recommend running one pipeline/build at a time?

rohanKanojia commented 6 years ago

I'm gonna try with <instanceCap> as per advised by @rupalibehera and see how it goes:

                     <privileged>false</privileged>
                      <alwaysPullImage>false</alwaysPullImage>
                      <remoteFs>/home/jenkins/workspace</remoteFs>
                      <instanceCap>3</instanceCap>
                      <slaveConnectTimeout>0</slaveConnectTimeout>
rohanKanojia commented 6 years ago

Sadly setting instanceCap also doesn't seem to solve this issue. I also tried setting <containerCap> to 1 using the Jenkins console as well, but this also doesn't seem to help :confused: . Will provide an update if I find some other solution. screenshot from 2018-04-27 20-42-33

hrishin commented 6 years ago

Update:

After evaluating overall options to limit concurrent builds It's better to handle it at Jenkins level. To make it work we need to fix the behavior of containerCap parameter in kubernetes Jenkins plugin which is broken ATM.

cc: @rohanKanojia @maxandersen

hrishin commented 6 years ago

Issue for kubernetes plugin is reported and PR is in queue.

hrishin commented 6 years ago

Update:

Filed one more issue for Kubernetes Jenkins plugin https://issues.jenkins-ci.org/browse/JENKINS-51286 . Sent a PR https://github.com/jenkinsci/kubernetes-plugin/pull/322 for the same.

With this PR, we could able to restrict one build at a time.

hrishin commented 6 years ago

Wondering if there is an issue on OSO side as well. https://github.com/openshift/origin/issues/15143

Update: This suspect looks invalid https://github.com/openshift/origin/issues/15143#issuecomment-389520199

maxandersen commented 6 years ago

and would using https://github.com/openshiftio/openshift.io/issues/3450 not be viable solution ?

hrishin commented 6 years ago

and would using #3450 not be viable solution ?

Not sure.

@jaseemabid has tried to grok this AFAIK. Its turn out that simply reducing the compute/memory resource is not working because the current state of Jenkins. Probably we can give it one more shot to see if can run Jenkins on optimal resources.

for https://github.com/openshiftio/openshift.io/issues/3450#issuecomment-389506105

As discussed in this issue and on PR would we not be better of keeping memory as is and reduce the pod count ? This would allow one build to start with enough memory and then other builds would be queued.

WIP https://github.com/openshiftio/openshift.io/issues/2729#issuecomment-388767697

hrishin commented 6 years ago

Update:

Tested the kubernetes plugin PR on http://openshift.io. Its restricting one build at a time to run, other build requests getting queued.

image

hrishin commented 6 years ago

Update:

Upstream PR https://github.com/jenkinsci/kubernetes-plugin/pull/322 is still under review (no response from upstream maintainers yet). For the time being, shall we fork the upstream project in fabric8 GitHub org and ship the plugin version from the fork to Jenkins? (though I'm against of maintaining the upstream fork)

cc: @krishnapaparaju @maxandersen @pradeepto @rupalibehera

krishnapaparaju commented 6 years ago

@hrishin please go ahead with fork and lets target promoting changes to OSIO production by end of this week..

hrishin commented 6 years ago

Update:

1) Sent a PR to forked repo https://github.com/fabric8io/kubernetes-plugin/pull/6 2) Sent a PR(s) to update Jenkins images

cc : @rupalibehera @krishnapaparaju

hrishin commented 6 years ago

Update:

Still, there is one more hole in the kubernetes plugin behavior. Its not respecting the container cap behavior sometimes. The sequence of events could be like this:

  1. Jenkins wants to create a slave pod for the first build request. Container capacity is not exceeded.
  2. The pod is being deployed to Kubernetes (it is starting).
  3. Jenkins wants to create another pod for the second build request, it asks the fabricat8's client[1] if there are any slave pods running.
  4. fabricat8's client responds that there are no pods (because the pod from step (2) is still being deployed (it is not running yet)).
  5. Jenkins creates another pod, effectively exceeding Container Capacity setting.

Need to fix this hole.

image

[1] https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L438

@krishnapaparaju @rupalibehera

hrishin commented 6 years ago

Update:

After evaluating all plugins[1][2][3] to throttle the concurrent builds, none of them working reliably. Most of the time it is scheduling more than two jobs because of a race condition occurring at Jenkins scheduler.

Neither existing throttling plugins nor setting Jenkins configuration ('executers limit') is working at all.

To achieve this functionality, we have to implement this feature by writing our own or forking existing plugin.

[1] https://wiki.jenkins.io/display/JENKINS/Throttle+Concurrent+Builds+Plugin [2] https://wiki.jenkins.io/display/JENKINS/Build+Blocker+Plugin [3] https://github.com/jenkinsci/kubernetes-plugin

cc: @krishnapaparaju @jaseemabid @maxandersen

jaseemabid commented 6 years ago

We have evaluated the obvious options, none worked and now we are looking at adding a feature to build.now, which we shouldn't. This is not a SEV1, shouldn't be a P0 and hopefully something we can ignore.

krishnapaparaju commented 6 years ago

@hrishin Please share job names and regular expression been tried out (Build blocker plugin)

jaseemabid commented 6 years ago

@krishnapaparaju hello, world and .*.

hrishin commented 6 years ago

FYI @sthaha @jaseemabid @krishnapaparaju entry point for the build locker plugin to dispatch or block the job for execution.

@lordofthejars could you please help here to understand the plugin and Jenkins behaviour?

krishnapaparaju commented 6 years ago

Try this Regular expression , this works

' .*.* '

lordofthejars commented 6 years ago

@hrishin I have no idea what else to add. What I see is that if a build is occurring and you need a new agent then this agent is started, but it takes some time to get started so if another build enters to the system since the agent is not already started then it starts another one. As you said it seems a race condition. So I was never involved in any master/agent communication task, so the only thing that comes to my mind would modify the plugin (probably k8s plugin if we have the knowledge) to do the next task:

A request to create an agent comes, lock and create a file, and unlock. Then any other request just check this file, if it is present then it needs to wait until it disappears and after that create a new agent. Problem with that is that if you miss for any reason to delete this file after the first build finishes (maybe because a failure) then you are in a deadlock. So I don't like so much this solution because there is a risk of dead lock, but nothing else comes to my mind right now.

Other option might be lock the process until the agent is up and running so any other request is locked until the agent is up and running, it is a different way of doing but again we really need to take care about dead locks.

jaseemabid commented 6 years ago

@krishnapaparaju Could you explain why .* wont work but .*.* works? We are shipping this to prod and we should know why things work or does not. Could you send a PR to build-blocker-plugin if its really an issue with the plugin?

hrishin commented 6 years ago

Update:

Using build blocker plugin Jenkins is able to throttle builds.

image

Final PR's for this issue: 1) Changed build block plugin code for OSIO : https://github.com/fabric8-jenkins/build-blocker-plugin/pull/1 2) Fabric8 Jenkins Image

hrishin commented 6 years ago

accidentally closed by PR merge

rupalibehera commented 6 years ago

https://github.com/openshiftio/saas-openshiftio/pull/904, raised PR to get this in Production and raised an issue https://github.com/openshiftio/openshift.io/issues/3752

hrishin commented 6 years ago

Auto closed by PR merge

rupalibehera commented 6 years ago

Closing this is should be in prod https://github.com/openshiftio/openshift.io/issues/3752.

hrishin commented 6 years ago

Since this fix is in the production, only one pipeline can run at a time. It would make sure not throwing random Soket time out exception .

@krishnapaparaju @pradeepto

krishnapaparaju commented 6 years ago

great. thanks @hrishin