microsoft / mssql-docker

Official Microsoft repository for SQL Server in Docker resources
MIT License
1.74k stars 759 forks source link

Recovery mode every time a new container starts #303

Open robertmiles3 opened 6 years ago

robertmiles3 commented 6 years ago

I'm trying to build a pre-seeded image from a .bak file. I have a script to startup a mssql-server-linux container, add the .bak file, restore it, and then docker commit that to a new image. However, whenever I startup a container from that new image, it is in recovery mode for both tempdb as well as my DB I restored. The recovery takes a good 6-8min which is a pain. As a test, I verified that all was well with the temp container before committing as a new image. I restarted it several times, and it never does a recovery mode. However, as soon as I save as a new image and then start a container off that new image it has to go through recovery.

Is something happening when I create a new image that forces them into recovery the next time it starts up?

robertmiles3 commented 6 years ago

For reference, here is the section of the startup log where it takes almost 2min to get tempdb up and running...

2018-05-08 13:50:11.34 spid11s     Starting up database 'tempdb'.
sqltemp_1  | 
DBSTARTUP (tempdb, 2): FCBHeaderReadTime took 6538 ms
sqltemp_1  | 
DBSTARTUP (tempdb, 2): FileMgrPreRecoveryTime took 114135 ms
sqltemp_1  | 
DBSTARTUP (tempdb, 2): SysFiles1ScanTime took 107593 ms
sqltemp_1  | 
2018-05-08 13:52:05.79 spid11s     The tempdb database has 1 data file(s).
robertmiles3 commented 6 years ago

Does anyone have any thoughts on this? It's been 5 weeks without a response.

twright-msft commented 6 years ago

Just wondering if you could do this a different way. If the point is to create an image that has a .bak file in it and that's the only difference between the base mssql-server-linux image and your image then I dont think you need to start up a container to put the .bak file in it. I'd suggest just creating a new Dockerfile which is as simple as this:

FROM microsoft/mssql-server-linux:latest (or whatever tag you want) COPY <path on your local file system to the .bak file> /var/opt/mssql/backups (or whatever location you want) CMD [ "/opt/mssql/bin/sqlservr" ]

Then just docker build that Dockerfile and see how the start up is then when you create a new container. It will have to go through the first time start routine but it shouldnt have to go through recovery. It shouldnt be any different than starting a container using mssql-server-linux.

robertmiles3 commented 6 years ago

Thanks for the reply. But, the point is to create a container with that .bak file already restored inside the image (not just housing the .bak file). That way upon container create/startup the DB is already restored and ready to go.

twright-msft commented 6 years ago

OK, here's another idea. No idea if this would work since I havent tried it, but it's worth a shot.
1) Create a new Dockerfile based on mssql-server-linux, copy in the .bak, etc. as described above but change the CMD to use an entrypoint script instead of directly starting sqlservr. This will make sqlservr not be PID 1 2) Start the container, restore the backup, etc. Then gracefully stop the sqlservr process via docker exec -it. Since sqlservr isnt PID 1 the container will keep running. Give sqlservr the time it needs to gracefully shut down. Once you can no longer see sqlservr in top then you can stop the container and docker commit it.

robertmiles3 commented 6 years ago

I'll try it and see. Thanks for the tip!

robertmiles3 commented 6 years ago

Well, I'm stuck on this for now because of (I think) a Docker bug. I'm using Docker edge (to get raw file perf), but every time I try to docker commit I get a "unexpected EOF". (I have plenty of docker space). I've filed an issue, so we'll see where that gets me.

robrich commented 6 years ago

I'm unable to reproduce the issue. I've built a Dockerfile that builds a "dev safe" copy of prod's backup as an image devs can pull each morning. In the corresponding init SQL I restore a backup, and change a few things. I very specifically set recovery mode to simple to ensure when I commit the image it's transactionally sound. (My assumption is it won't finish the SQL command until the data is written to the mdf and flushed to disk.) I grant my database is significantly smaller than yours. But both when building the image (starting the DB for the first time) and running the container (restarting the db) I don't see any mention of FileMgrPreRecoveryTime or SysFiles1ScanTime in the console output. I do see a bunch of ## transactions rolled forward in database [yy] and ## transactions rolled backward in database [yy] for each database (user and system) as the container starts up.

Jejuni commented 4 years ago

We're having the same problem right now. Out database is around 30GB, .bak file 3GB. Out temp DB, however, starts immediately. It's only the main db that's causing the problem.

We're putting a .bak file into the container, performing the restore and then committing the container to an image. Once we create a new container from the image and start it, the database is stuck in "In Recovery" for a couple of minutes.

Executing the same steps but using a volume for /var/opt/mssql and then starting a fresh sqlserver container with that volume attached shows no sign of being In Recovery at all. Sadly this is not an option for us as we need the data in the resulting image available immediately and stored with the container.

We're using mcr.microsoft.com/mssql/server:2019-latest.

Output from after In Recovery completes:

DBSTARTUP (AISTO_XT_DEV, 5): FCBOpenTime took 114128 ms
DBSTARTUP (AISTO_XT_DEV, 5): FileMgrPreRecoveryTime took 244 ms
robrich commented 4 years ago

@Jejuni Do you have a GitHub repo that reproduces this problem you can share?

Jejuni commented 4 years ago

@Jejuni Do you have a GitHub repo that reproduces this problem you can share?

I've created a repo here: https://github.com/Jejuni/docker-db-recovery

It contains a Powershell script that does the following:

The script needs:

This script is a heavily modified version of the script we're currently using for our backup/restore process. The resulting image is called onpremise_data.

To repro the issue, after running the script:

Please keep in mind:

robrich commented 4 years ago

This is a great repro setup. My initial guess before I run it is the in-memory table is biting us or the change to recovery mode simple isn't a blocking operation and hasn't finished when we commit. Now to fire it up...

robrich commented 4 years ago

Wow, that's weird. I tried it in my rig too, and it also said In Recovery for a good while.

xied75 commented 4 years ago

Not trying to upset anyone, do any of you hold a SQL DBA hat? If you shut a SQL Server in a Non-Proper manner, it will do the recovery next time, linear time according to the data file size, since the db engine needs to make sure what to roll forward and what to roll back, by scanning the log. If you want to avoid this, you need to tell the dear SQL engine, e.g. with a CHECKPOINT? I don't see any of you mention anything in your description of the problem or the re-pro steps. That ps1 file doesn't even bother to tell SQL to shutdown?!

SQL Server is a huge product and millions of professionals eat on it. Making it work within a container doesn't remove all the necessary knowledge needed to make it work properly. If you ignore the basics then later on you might be facing even more trouble (imagine encounter something weird well-into production for 6 months?)

DevOps are meant to be the glue, not to replace other roles, at least get a real DBA into the discussion?

4am of my time-zone, so my wording lack the general manner one should have, sorry...

Jejuni commented 4 years ago

I don't consider myself an authority to on sql at all, but here's my 2 cents

xied75 commented 4 years ago

(p.s. I totally understand you are doing this for integration test, I did something similar 5 years ago without docker).

Jejuni commented 4 years ago

After some deliberation I figured out what the problem is. In hindsight it's pretty obvious, but hindsight's 20-20.

The culprit is the union filesystem, in particular overlay's copy-on-write implementation. Data gets put in the image in the build process, in our case that's the database. Then, on first modification to a file, that file is copied from the lower directory to the upper directory of the overlay filesystem and persisted there with its modification. So, on sqlservr startup, the database files get modified by sqlservr in the merged directory and thus have to be completely copied over. In our case that's ~40gb, explaining the 120s delay. To sqlservr it just looks like the files are taking forever to access, thus the In Recovery.

Also explains why volumes work, why this doesn't happen on container stop&start and why it scales with db size.

BlairMcc commented 3 years ago

Hey @Jejuni, how do you manage when you want to create a new container with a fresh data volume? I've changed my docker run from docker run -p 1402:1433 --name myhugedbimage01 -d myhugedbimage to docker run -p 1403:1433 -v sql02:/var/opt/mssql --name myhugedbimage02 -d myhugedbimage

But the only change noticed is that instead of a container running relatively quickly then taking 10mins~ in recovery, it's taking 10mins~ to fill out the data volume and create the container with a working database. So in the end they're both much the same. Is there a better way than this..?

stupied4ever commented 2 years ago

I am with the exact same problem, you could find out how to fix it?

robrich commented 2 years ago

@stupied4ever: check out the comment a few lines above: https://github.com/microsoft/mssql-docker/issues/303#issuecomment-702229652. It's the time it takes to copy the db file(s) to the read/write layer. The simplest "fix" is to shrink your database before embedding in the container; granted this isn't always possible. The other "fix" is to get comfortable with the wait by perhaps running this container and grabbing a cup of tea or running this container a few minutes before you need it; granted this is usually less desirable.

stupied4ever commented 2 years ago

@robrich Thx for the attention. I Already did the shrink and it did the things faster (as the database is much smaller), but i still have the "Waiting until recovery". If someone find some alternative, please post here.

jimmywall commented 1 year ago

@twright-msft I made sure sql server wasn't PID 1, and manually started sql server, and I still get the same behavior. Looks like @Jejuni is correct. I am going to to move my data into a separate container than sql server and use that data container as a volume mapped to my sql server container. When i need a new container, i'll just have to create both at the same time.