harness / gitness

Gitness is an Open Source developer platform with Source Control management, Continuous Integration and Continuous Delivery.
https://gitness.com
Apache License 2.0
32.09k stars 2.8k forks source link

Builds occasionally stuck in the "Building" state with Docker 1.12 #1817

Closed AlekseyMartynov closed 7 years ago

AlekseyMartynov commented 7 years ago

In the continuation of https://gitter.im/drone/drone?at=5803905e891a53016314d30d

uname -a

Linux drone 4.4.0-31-generic #50-Ubuntu SMP Wed Jul 13 00:07:12 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 18:29:41 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.12.2
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   bb80604
 Built:        Tue Oct 11 18:29:41 2016
 OS/Arch:      linux/amd64

commands to start drone server:

docker network create drone-server-net

docker run -d \
    --name=drone-server \
    -e DRONE_SECRET=$DRONE_SECRET \
    -e DRONE_GITHUB=true \
    -e DRONE_GITHUB_CLIENT=$GITHUB_CLIENT \
    -e DRONE_GITHUB_SECRET=$GITHUB_SECRET \
    -e DRONE_ORGS=$GITHUB_ORG \
    -e DRONE_OPEN=true \
    -e DRONE_ADMIN=$GITHUB_ADMIN_USERS \
    -v drone-data:/var/lib/drone \
    --net=drone-server-net \
    -p 80:8000 \
    drone/drone:0.5 server

docker run -d \
    -e DRONE_SECRET=$DRONE_SECRET \
    -e DRONE_SERVER=ws://drone-server:8000/ws/broker \
    -e DRONE_TIMEOUT=15m \
    -e DOCKER_MAX_PROCS=4 \
    -v /var/run/docker.sock:/var/run/docker.sock \
    --net=drone-server-net \
    drone/drone:0.5 agent

YAML:

pipeline:

  build:
    image: node:6.8
    commands:
      - npm i gulp webpack karma --color=always -q

matrix:
  include:
    - axis: 1
    - axis: 2
    - axis: 3
    - axis: 4

When all worker container are exited, some builds occasionally are stuck in "orange" state:


1


2


And containers are not removed, ambassador exit code is 137:

docker ps -a

CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS                            PORTS                  NAMES
740c7d3519af        node:6.8             "/bin/sh -c 'echo $DR"   3 minutes ago       Exited (0) About a minute ago                            drone_dUpqOVZnNW4_1
d5a19a6bed0f        plugins/git:latest   "/bin/drone-git"         3 minutes ago       Exited (0) 3 minutes ago                                 drone_dUpqOVZnNW4_0
931454065d90        busybox:latest       "/bin/sleep 86400"       3 minutes ago       Exited (137) About a minute ago                          drone_ambassador_aa49ZRmHlfA
f8e6526d9611        drone/drone:0.5      "/drone agent"           17 minutes ago      Up 17 minutes                     8000/tcp               drunk_ritchie
2d36a428e71a        drone/drone:0.5      "/drone server"          17 minutes ago      Up 17 minutes                     0.0.0.0:80->8000/tcp   drone-server

The issue doesn't not occur without --color=always (i.e. with no console colors).

bradrydzewski commented 7 years ago

I'm unable to reproduce this issue unfortunately using the yaml provided in this issue. What do you see in your drone agent logs? you server logs? Add DRONE_DEBUG=true to get a bit more detail. What about the docker logs?

Also a 137 exit code perhaps indicates you are running out of memory on your host machine?

AlekseyMartynov commented 7 years ago

Ok, I'll capture logs and be back.

Also a 137 exit code perhaps indicates you are running out of memory on your host machine?

I checked, this container was OOMKilled=false. As I understand ambassadors are empty and have the lowest chance to be OOM killed, aren't they? The host machine was 1GB 1CPU VM similar to Digital Ocean second size droplet, with swap enabled.

bradrydzewski commented 7 years ago

The 137 for the ambassador container is normal. When your build finishes drone has to teardown the ambassador container by running docker stop. Docker sends a SIGTERM to the ambassador entrypoint (/bin/sleep) which results in a 137 exit code. So nothing abnormal there.

The fact that the ambassador is being terminated, but the build is not being reported as finished, indicates to me that something else is hanging. Perhaps a call to stop or kill one of the containers? Or the call to fetch logs? Or maybe it is failing to post the logs to the drone server. Tough to say, but maybe the agent or server logs will help.

AlekseyMartynov commented 7 years ago

Here are the logs captured on a clean VM. (DELETED)

In this case, I have 1 matrix branch completed and 3 hung, eventually failed with the message "ERROR: terminal inactive for 15m0s, build cancelled".

These lines from agent.log seem interesting:

time="2016-10-17T09:52:09Z" level=info msg="Finished build [DELETED]#1.3" 
time="2016-10-17T10:07:10Z" level=error msg="recover stopping the pipeline send on closed channel" 
time="2016-10-17T10:07:10Z" level=error msg="recover stopping the pipeline send on closed channel" 
time="2016-10-17T10:07:11Z" level=info msg="Finished build [DELETED]#1.1" 
time="2016-10-17T10:07:11Z" level=error msg="recover stopping the pipeline send on closed channel" 
time="2016-10-17T10:07:12Z" level=info msg="Finished build [DELETED]#1.4" 
time="2016-10-17T10:07:12Z" level=info msg="Finished build [DELETED]#1.2" 
bradrydzewski commented 7 years ago

terminal inactive for 15m0s, build cancelled indicates that your build just sat there for 15 minutes with nothing being output to the terminal. Can you think of any reason nothing would be output to the console for 15 minutes?

You should also be able to inspect what is happening for the 3 hanging builds. What is happening inside those containers? The sample yaml you provided would get turned into a shell script like this and executed inside the container:

#!/bin/sh

set -e
set -x
npm i gulp webpack karma --color=always -q

So once drone starts the container, done simply waits for docker to signal the script has completed. What do the docker logs reveal? Do we see the command running and outputting logs? Do we see the Docker container just hanging with no logs? There are many layers here, both at the drone level, the docker level, and the OS level and understanding where the problem is being caused is important to better understand what is happening here.

AlekseyMartynov commented 7 years ago

indicates that your build just sat there for 15 minutes with nothing being output to the terminal.

No. Above I posted a screenshot (one with the red frame) demonstrating that the build script has exited with code 0, but drone continues to wait. At this stage 3 worker containers (git, ambassador and task itself) are stopped but not removed. When the timeout interval elapses, containers are removed and the task gets marked as timed out.

Intuitively it feels like some sort of a race, when several matrix sub-tasks complete almost simultaneously. But I don't have any better insight.

bradrydzewski commented 7 years ago

Intuitively it feels like some sort of a race, when several matrix sub-tasks complete almost simultaneously. But I don't have any better insight.

Let me know if there is some information I can offer to help. Since I cannot repeat the problem, there is little I can do other than provide tips for digging into the code and debugging.

AlekseyMartynov commented 7 years ago

The best thing that comes to my mind is that I could spin up a small testing VM with a testing repo reproducing the problem and give you shell access. Would that be helpful?

bradrydzewski commented 7 years ago

The best thing that comes to my mind is that I could spin up a small testing VM with a testing repo reproducing the problem and give you shell access. Would that be helpful?

I certainly appreciate the offer. I would love to address this issue, however, this is not currently a top priority for me and I'm in the middle of some other taks. This will get added to my backlog and I will address when I have completed some higher priority items for the upcoming 0.5 release candidate.

In the mean time I've labeled this issue as "help wanted" so that perhaps other members of the community can take a look and debug. If this is critical to your workflow, and something you'd like to see resolved sooner, I highly encourage digging into the code to see if you can identify and issue and submit a patch.

AlekseyMartynov commented 7 years ago

@bradrydzewski

I believe I found the cause. Hanging occurs when there is a significant delay in the middle of printing an ANSI escape sequence.

Here is the minimal example:

#!/bin/bash

printf "\e"
sleep 2

printf "["
sleep 2

printf "0"
sleep 2

printf "m"

Would you please try to reproduce it on your side? Here is the repo I'm using: https://github.com/AlekseyMartynov/drone-issue-1817, with 4 concurrent workers.

bradrydzewski commented 7 years ago

I tried with the following yaml and was unable to reproduce:

pipeline:
  example:
    image: golang:1.6
    commands:
      - printf "\e"
      - sleep 2
      - printf "["
      - sleep 2
      - printf "0"
      - sleep 2
      - printf "m"

matrix:
  include:
    - axis: 1
    - axis: 2
    - axis: 3
    - axis: 4

Can you confirm you are using the latest drone build? I have been running 0.5.0+761 for the past two weeks which was only merged into master yesterday. Perhaps this explains the difference in behavior we are seeing?

AlekseyMartynov commented 7 years ago

Updated server and agents to 761, still reproduced easily.

And, it appears that ANSI codes are not to blame. The following commands are sufficient:

    commands:
      - printf "."
      - sleep 5
      - printf "."

but if there are line feeds, then all builds finish:

    commands:
      - printf ".\n"
      - sleep 5
      - printf ".\n"

Next I will try to use a different OS on the host (currently on Ubuntu 16.04)

bradrydzewski commented 7 years ago

@AlekseyMartynov there is an internal buffer that uses newlines (\n) to flush the buffer and write the logs. It is true that not printing newlines can delay flushing and therefore delay display. That being said, in order for drone to kill the build due to lack of terminal activity, you would need 5 minutes of console logs with zero newlines printed (ie one giant, long line). That would seem like an odd behavior, though. Is this what your build is doing?

AlekseyMartynov commented 7 years ago

What you are saying is logs are buffered until there's a newline char. That's no problem. The closest description of what I see is "starting a command when the previous command haven't printed a line feed causes hanging sometimes"

I debugged further and determined that the issue is specific to Docker 1.12.x Builds exit successfully with Docker 1.10.3 and 1.11.2

Are you using the latest Docker too?

appleboy commented 7 years ago

I can confirm this issue with Docker version 1.12.2.

steve-gray commented 7 years ago

@AlekseyMartynov : I'm seeing this on CentOS/RHEL. Docker is 1.12.1, what I'm doing is:

My understanding is that if the small 'exit code' message appears at the bottom of the log, Drone saw the container stop - so the error is unlikely to be in the docker daemon - it's as if something happens inside the agent that causes it to skip a beat handling the completion.

steve-gray commented 7 years ago

Digging into the daemon logs gives me this inside docker:

Maybe there's an exception thrown during logger closure from Docker occasionally and if that's unhandled, you get phantom job timeouts?

Update: Digging into the agent, I see this in the debug output:

It appears when the timeout kicks in. Reading the code in pipeline.go, I suspect the log channel is gone (build/docker/docker.go) but that we're still waiting for a decrement/done() on the waiter.

AlekseyMartynov commented 7 years ago

May be due to https://github.com/docker/docker/issues/27779 I'm going to try a master build to see if it's the case.

steve-gray commented 7 years ago

I've just tried something else. What I did was:

docker run --rm -v /var/run/docker.sock:/var/run/docker.sock -it golang:1.6 bash
mkdir -p src/github.com/drone
cd src/github.com/drone
git clone https://github.com/drone/drone
cd drone
make deps gen
make build
export DRONE_SECRET=your-secret
export DRONE_SERVER=ws://your-server-http-ep/ws/broker
export DRONE_DEBUG=true
apt-get update
apt-get install vim -y
./release/linux/amd64/drone agent 

And it worked. I'm switching over my build servers to the custom image and it just works. Every time. This smells to me like a golang runtime issue - and all that's 'different' is I'm using a more recent docker image.

My golang image layers are:

Layers": [
                "sha256:f96222d75c5563900bc4dd852179b720a0885de8f7a0619ba0ac76e92542bbc8",
                "sha256:149636c850120e59e6bb79f2fc23ed423030afc73841c221906a147d61da11a9",
                "sha256:33c3a104206aed2ae947e03c48cc011af0a3e5b87e7ba8e7cbc298273a638921",
                "sha256:ad847bbb88ef19bea18ab712c8c7dcc281a0839df136b2b73c5d569ee2660190",
                "sha256:3e92d03948af05fc2988ef21a7804a6cf0f9c7de5819031fcbf413a3b928d4ee",
                "sha256:80e1f6b2c106af6cf8698db937214a85afef2eba9b1c9303147aadeadcd62d8c",
                "sha256:f849e00a336ef8f05885ac1b878ec280f7b74ffe540ae67dae57973e964d3bc5"
            ]

Update 2: After some spontaneous working consistently with new image, refusing to work with official drone image - it appears to have reverted back to form and started hanging. It may be something more transient - but it's odd that I had it working consistently one way and not the other - and was able to switch between the two states.

steve-gray commented 7 years ago

I concede: it's almost certainly that JSON logging issue you've found. I suspect the success rate might be due to the amount of log scraping going on for various servers, and the other image not having as many eyes watching it.

Stuck waiting for an engine update now :-/

AlekseyMartynov commented 7 years ago

Yep, just checked it. Tasks don't hang with docker-1.13.0-dev.tgz

steve-gray commented 7 years ago

Yeah, seeing the same. Want to close this one off?

Sent from my iPhone

On 3 Nov. 2016, at 22:46, AlekseyMartynov notifications@github.com wrote:

Yep, just checked it. Tasks don't hang with docker-1.13.0-dev.tgz

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

AlekseyMartynov commented 7 years ago

Do you guys think it's worth mentioning this problem in docs? The current recommendation is simply:

latest stable version of Docker and officially support Docker 1.N-1 (latest minor release minus one).

AlekSi commented 7 years ago

Can reliably reproduce this issue with Docker 1.12.4. Fixed by switching to jourlnald logging.

bradrydzewski commented 7 years ago

closing this since it seems docker specific