microsoft / mssql-docker

Official Microsoft repository for SQL Server in Docker resources
MIT License
1.74k stars 758 forks source link

Graceful container stopping #171

Open jest opened 7 years ago

jest commented 7 years ago

On Linux, using Docker CLI it is not possible to gracefully stop the container. Running docker stop causes the daemon to send TERM signal to the container process, which is ignored and only KILL signal causes the server to stop. However, this is abrupt and the next time the container is started it rolls forward logs.

However, I noticed that the main container process forks additional sqlservr processes and if I send TERM signal to one of those processes, the whole container shuts down gracefully immediately and no log replaying is performed on the next startup.

It is looks like the problem with the process and signals management.

sokomishalov commented 7 years ago

+1

Glideh commented 7 years ago

I got stuck too, here is how I did:

version: '3'
services:
    db:
        image: microsoft/mssql-server-linux
        environment:
            ACCEPT_EULA: Y
            SA_PASSWORD: "xyz"

This runs successfully. I shared the port on my host

#...
        ports:
            - "1433:1433"

And updated the container:

$ docker-compose up -d db
ERROR: for my_db_1  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

Now it can't be stopped nor killed

$ docker-compose stop db
# Same timeout error

$ docker-compose kill db
# Same timeout error

$ docker-compose kill -s TERM db
# Gets stuck

I even can't stop the docker service anymore (I had to restart the computer). Current version:

Client:
 Version:      17.05.0-ce
 API version:  1.29
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:04:27 2017
 OS/Arch:      linux/amd64
Glideh commented 7 years ago

It worked after a computer restart but happened again after a docker-compose up refresh. A running mssql container doesn't seem liking to be updated with up

Symbianx commented 7 years ago

I've been having this issue too, the container freezes and a computer restart is the only way to stop it.

edsonmedina commented 6 years ago

+1 Same here

helenwilliamson commented 6 years ago

+1 Same here.

akas84bg commented 6 years ago

+1 same here

twright-msft commented 6 years ago

We think what's going on here is that SQL Server is gracefully shutting down so that when it starts back up there is no recovery time needed. We could change it to a fast shut down on CTRL+C but that could mean a longer period of time for recovery on start up depending on what was going on in the database(s) prior to the CTRL+C.

I used to run into this issue too, but then I stopped running docker-compose up and docker run interactively. In other words use -d on docker run and docker-compose up so that the containers are always started in the background and you can get your terminal prompt back. Then you can stop your containers with docker stop or docker-compose down.

Also, when you use CTRL+C to a docker-compose up interactive, you can hit CTRL+C again to force stopping immediately. I don't recommend that in anything except for a dev/test environment where you just don't care.

A few questions to better help us understand how to improve here:

jest commented 6 years ago

Sorry, but this is not a graceful shutdown. No matter how much time is given with docker stop -t time, SQL Server never stops within this time period.

It has nothing to do with CTRL+C, I never used it. My containers are started with docker-compose up -d and stopped with docker-compose stop -t time.

Please also note what I have written about sending TERM signal to one of forked processes: it leads to gracefully stopped container within 1 second! If I were to guess, I'd go for checking correct signal handling in the parent process.

Glideh commented 6 years ago

I didn't use Ctrl+C either (the first one is supposed to gracefully shutdown anyway) I'm also using docker-compose stop (after up -d) with really nothing big going on the database (tested with only one database and 4 empty tables actually).

@twright-msft I used to start/stop many different services (like nginx, apache, php, python, mysql, postgresql, redis, memcached, etc...), they always stop gracefully within 4sec max. I might still have a preference for the fast startup/slow shutdown but the slow should be within 4sec. Anyway, as @jest says, sometimes it never stops, I already tried leaving the graceful stop running for at least 30min.

edsonmedina commented 6 years ago

I've seen both cases, both intermittently. Sometimes they work, sometimes they don't.

Using CRTL + C just hangs forever (I've waited more than 10 minutes) and the container becomes non-responsive (can't exec into it).

Using docker-compose stop/down instead returns me a timeout. The container never dies.

This makes it useless until it's fixed.

jest commented 6 years ago

OK, I dig a bit and here's a solution.

The problem is this line in Dockerfile

CMD /opt/mssql/bin/sqlservr

According to the Docker docs its "shell syntax" causes Docker daemon to run the container with a command:

/bin/sh -c /opt/mssql/bin/sqlservr

Which makes Bash a "PID 1" process and causes a lot of problems, including signal handling and children reaping. The issue on tini describes it pretty well.

The solution is to modify Dockerfile and either to make sqlservr "PID 1" itself using another CMD syntax:

CMD ["/opt/mssql/bin/sqlservr"]

or better yet, to use some other "process manager", like the mentioned tini:

# with tini next to Dockerfile...
COPY tini /
RUN chmod +x /tini
ENTRYPOINT ["/tini", "--"]
CMD ["/opt/mssql/bin/sqlservr"]

As a workaround till new images are available, use command: [ "/opt/mssql/bin/sqlservr" ] in your docker-compose.yml to overwrite the image's CMD.

jest commented 6 years ago

@twright-msft Any idea how this will be solved? Do you need a PR?

woylie commented 6 years ago

Any news on this? We're using the container for testing in a CI pipeline and have to restart our server practically every day because of this. Neither overwriting the command with CMD ["/opt/mssql/bin/sqlservr"] nor adding tini as suggested help with the problem.

twright-msft commented 6 years ago

We're likely going to switch to this in a near future release. CMD ["/opt/mssql/bin/sqlservr"] We'll see if that helps fix it for at least some people.

woylie commented 6 years ago

Well, for us it didn't. Any more ideas?

jest commented 6 years ago

Probably other issue?

simdevmon commented 6 years ago

The workaround command: [ "/opt/mssql/bin/sqlservr" ] did not work for me either.

simdevmon commented 6 years ago

I use the following workaround in our CI environment:

jest commented 6 years ago

Did you destroy the old containers and created new ones with command: workaround? Once created, containers can't change their command to be executed. What does docker inspect -f '{{ .Config.Cmd }}' <container-name> says?

simdevmon commented 6 years ago

@jest The output is [/opt/mssql/bin/sqlservr]

And yes, since it is only a CI environment I destroy everything completly on each build

docker exec <mssql-container-name> kill 1 || :
docker-compose stop
docker-compose rm -f
docker-compose build
docker-compose up -d
kevin-brown commented 6 years ago

We started running into issues with the MS SQL Server containers hanging around on our Jenkins instance after builds completed (or didn't). It eventually got bad enough that the servers would lock up and de-provisioning them would take up to 30 minutes.

The solution for killing process 1 seems to solve the issue for us: https://github.com/Microsoft/mssql-docker/issues/171#issuecomment-362193062

kevin-brown commented 6 years ago

Update: overriding the command within a Dockerfile, or through specifying it when running, did not solve the problem of zombie processes and MS SQL Server.

We are seeing a problem very similar to #181, which has the same behaviour as the issue described in this ticket, after using a SQL Server instance (CU2, CU4, GA tested) for a short period of time and then trying to shut it down. I'm going to put the odds of it hanging at 50/50 every time we spin up a new container. Sending the TERM or KILL signals to the container or sqlservr processes does not solve the issue for us, the processes refuse the die unless the system is de-provisioned.

Note that we are not using Docker Compose on our build servers, and we are seeing this issue when running the containers through the Docker engine directly.

jest commented 6 years ago

@kevin-brown So this issue is not the one you are experiencing. This issue is about wrong image's CMD construction, where signals are not propagated to child processes.

Sending signals directly to child processes is the same as correcting CMD in Dockerfile.

hdimitriou commented 6 years ago

@kevin-brown we are facing probably the same issue and we use tini but no luck. Do you believe that "-g" option on tini to kill the whole process group could make a difference? We are going to try it

kevin-brown commented 6 years ago

So this issue is not the one you are experiencing. This issue is about wrong image's CMD construction, where signals are not propagated to child processes.

We're seeing signs of the signals not propagating when we send them to the Docker images, and attempt to send them directly to the process. The behaviour we're seeing in #181 is making it really difficult to verify the signals are making it to sqlservr because if it hangs for too long it completely locks up Docker and the host system.

I'm willing to accept that there are two different issues at play in #171 and #181, but the fact that both of them deal with zombie processes forming within the container gives me hope that there may be a common solution to both issues.

we are facing probably the same issue and we use tini but no luck. Do you believe that "-g" option on tini to kill the whole process group could make a difference? We are going to try it

We have not yet tried using tini to work around this issue, but if you're not currently killing the right process (but instead are killing a parent process) that might work.

jest commented 6 years ago

Anyone having problems with CTRL+C that are not solved by correcting ENTRYPOINT (as described in comment https://github.com/Microsoft/mssql-docker/issues/171#issuecomment-346133376), please test 2017-CU5. According to https://support.microsoft.com/en-us/help/4093805/fix-can-t-stop-sql-server-linux-docker-container-via-docker-stop it's solved there.

jschaefer-pott commented 6 years ago

@jest CU5 seems to fix this for me. But with CU6 the same problem occurs again.

kichalla commented 5 years ago

I am using CU12 and I am seeing the same issue

badeball commented 5 years ago

I've faced some issues with this as well. I have been using a version which does not spawn mssql inside a shell (IE. I've been using a sufficiently recent version that contains addd8374e7ff488a916e4ed1ec634b364b649209), but still experience inability to shut down the container. docker kill halts and I can't even restart the daemon, I can only restart the machine.

The logs indicate that a signal was received, but it apparently entered some weird state afterwards.

[...]
2019-06-18 09:05:39.68 spid6s      Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
2019-06-18 09:05:39.68 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2019-06-18 09:05:39.78 spid22s     Service Broker manager has shut down.
2019-06-18 09:05:43.43 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:05:43.43 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]
2019-06-18 09:05:48.61 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:05:48.61 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]
2019-06-18 09:10:53.49 Logon       Error: 18451, Severity: 14, State: 1.
2019-06-18 09:10:53.49 Logon       Login failed for user 'NT AUTHORITY\SYSTEM'. Only administrators may connect at this time. [CLIENT: 127.0.0.1]

While a normal shutdown looks like following.

[...]
2019-06-18 10:33:05.68 spid6s      Always On: The availability replica manager is going offline because SQL Server is shutting down. This is an informational message only. No user action is required.
2019-06-18 10:33:05.68 spid6s      SQL Server is terminating in response to a 'stop' request from Service Control Manager. This is an informational message only. No user action is required.
2019-06-18 10:33:06.11 spid23s     Service Broker manager has shut down.
2019-06-18 10:33:11.29 spid6s      SQL Trace was stopped due to server shutdown. Trace ID = '1'. This is an informational message only; no user action is required.
Danielku15 commented 1 year ago

Using mcr.microsoft.com/mssql/server:2022-latest and I seem to face the same issue. SQL Server gets stuck on stopping the container. When I wait long enough (sometimes minutes) I get some timeout errors on the stop and then all of a sudden the container is also stopped.

ifaniqbal commented 1 year ago

So, after diving into the Tini and PID 1 issues, I gotta say, I'm blown away by how easy the latest solution is. You've got two options to choose from:

  1. If you are using docker run, add --init.
--init : Run an init inside the container that forwards signals and reaps processes 
  1. If you are using docker compose, simply add init: true under the services section of your compose file. For example:
    services:
    db:
    image: mcr.microsoft.com/azure-sql-edge:latest
    init: true

Just a heads up, when you use --init or init: true, an extra process is launched within the container that acts as the PID 1 process. This process takes care of managing the child processes inside the container and making sure that signals are forwarded correctly. This helps to ensure that the container shuts down gracefully and all the child processes are cleaned up properly.