Open robertmiles3 opened 6 years ago
For reference, here is the section of the startup log where it takes almost 2min to get tempdb
up and running...
2018-05-08 13:50:11.34 spid11s Starting up database 'tempdb'.
sqltemp_1 |
DBSTARTUP (tempdb, 2): FCBHeaderReadTime took 6538 ms
sqltemp_1 |
DBSTARTUP (tempdb, 2): FileMgrPreRecoveryTime took 114135 ms
sqltemp_1 |
DBSTARTUP (tempdb, 2): SysFiles1ScanTime took 107593 ms
sqltemp_1 |
2018-05-08 13:52:05.79 spid11s The tempdb database has 1 data file(s).
Does anyone have any thoughts on this? It's been 5 weeks without a response.
Just wondering if you could do this a different way. If the point is to create an image that has a .bak file in it and that's the only difference between the base mssql-server-linux image and your image then I dont think you need to start up a container to put the .bak file in it. I'd suggest just creating a new Dockerfile which is as simple as this:
FROM microsoft/mssql-server-linux:latest (or whatever tag you want) COPY <path on your local file system to the .bak file> /var/opt/mssql/backups (or whatever location you want) CMD [ "/opt/mssql/bin/sqlservr" ]
Then just docker build that Dockerfile and see how the start up is then when you create a new container. It will have to go through the first time start routine but it shouldnt have to go through recovery. It shouldnt be any different than starting a container using mssql-server-linux.
Thanks for the reply. But, the point is to create a container with that .bak file already restored inside the image (not just housing the .bak file). That way upon container create/startup the DB is already restored and ready to go.
OK, here's another idea. No idea if this would work since I havent tried it, but it's worth a shot.
1) Create a new Dockerfile based on mssql-server-linux, copy in the .bak, etc. as described above but change the CMD to use an entrypoint script instead of directly starting sqlservr. This will make sqlservr not be PID 1
2) Start the container, restore the backup, etc. Then gracefully stop the sqlservr process via docker exec -it. Since sqlservr isnt PID 1 the container will keep running. Give sqlservr the time it needs to gracefully shut down. Once you can no longer see sqlservr in top then you can stop the container and docker commit it.
I'll try it and see. Thanks for the tip!
Well, I'm stuck on this for now because of (I think) a Docker bug. I'm using Docker edge (to get raw file perf), but every time I try to docker commit
I get a "unexpected EOF". (I have plenty of docker space). I've filed an issue, so we'll see where that gets me.
I'm unable to reproduce the issue. I've built a Dockerfile that builds a "dev safe" copy of prod's backup as an image devs can pull each morning. In the corresponding init SQL I restore a backup, and change a few things. I very specifically set recovery mode to simple to ensure when I commit the image it's transactionally sound. (My assumption is it won't finish the SQL command until the data is written to the mdf and flushed to disk.) I grant my database is significantly smaller than yours. But both when building the image (starting the DB for the first time) and running the container (restarting the db) I don't see any mention of FileMgrPreRecoveryTime
or SysFiles1ScanTime
in the console output. I do see a bunch of ## transactions rolled forward in database [yy]
and ## transactions rolled backward in database [yy]
for each database (user and system) as the container starts up.
We're having the same problem right now. Out database is around 30GB, .bak file 3GB. Out temp DB, however, starts immediately. It's only the main db that's causing the problem.
We're putting a .bak file into the container, performing the restore and then committing the container to an image. Once we create a new container from the image and start it, the database is stuck in "In Recovery" for a couple of minutes.
Executing the same steps but using a volume for /var/opt/mssql
and then starting a fresh sqlserver container with that volume attached shows no sign of being In Recovery
at all. Sadly this is not an option for us as we need the data in the resulting image available immediately and stored with the container.
We're using mcr.microsoft.com/mssql/server:2019-latest
.
Output from after In Recovery
completes:
DBSTARTUP (AISTO_XT_DEV, 5): FCBOpenTime took 114128 ms
DBSTARTUP (AISTO_XT_DEV, 5): FileMgrPreRecoveryTime took 244 ms
@Jejuni Do you have a GitHub repo that reproduces this problem you can share?
@Jejuni Do you have a GitHub repo that reproduces this problem you can share?
I've created a repo here: https://github.com/Jejuni/docker-db-recovery
It contains a Powershell script that does the following:
The script needs:
This script is a heavily modified version of the script we're currently using for our backup/restore process.
The resulting image is called onpremise_data
.
To repro the issue, after running the script:
localhost,7433
with id: sa, password: Pass@worddocker run --name sql_test_with_data -p 7433:1433 onpremise_data
In Recovery
Please keep in mind:
In Recovery
phase will be very short (<3s on i9 9900k, wsl2 and 64GB Ram available)In Recovery
phase is thus around 120 seconds.restore.bak
, changing the RESTORE DATABASE
part of the script to reflect the name of your files and running the script. The In Recovery
time increases with database size.This is a great repro setup. My initial guess before I run it is the in-memory table is biting us or the change to recovery mode simple isn't a blocking operation and hasn't finished when we commit. Now to fire it up...
Wow, that's weird. I tried it in my rig too, and it also said In Recovery
for a good while.
Not trying to upset anyone, do any of you hold a SQL DBA hat? If you shut a SQL Server in a Non-Proper manner, it will do the recovery next time, linear time according to the data file size, since the db engine needs to make sure what to roll forward and what to roll back, by scanning the log. If you want to avoid this, you need to tell the dear SQL engine, e.g. with a CHECKPOINT? I don't see any of you mention anything in your description of the problem or the re-pro steps. That ps1 file doesn't even bother to tell SQL to shutdown?!
SQL Server is a huge product and millions of professionals eat on it. Making it work within a container doesn't remove all the necessary knowledge needed to make it work properly. If you ignore the basics then later on you might be facing even more trouble (imagine encounter something weird well-into production for 6 months?)
DevOps are meant to be the glue, not to replace other roles, at least get a real DBA into the discussion?
4am of my time-zone, so my wording lack the general manner one should have, sorry...
I don't consider myself an authority to on sql at all, but here's my 2 cents
sqlserver process taking a signal to "gracefully shut down" is new to me, I need to dig it out of the BOL. But I doubt it could be the case today (I only passed my exam on SQL 2005). SQL Server is meant to be running forever, tell it to shutdown and it is "it" to decide when, since it might need to commit millions of transactions to the log files, I wonder if the Linux PID 1 is willing to wait that long.
volume is outside the container, dockerd talk to it with direct I/O (out of my memory), when not using a volume, dockerd needs to go pass layered file system and do writing on modification stuff. Maybe (just guessing) SQL engine was treating volume/non-volume with different handling. Back to the classic world a net mapped drive is different compared to local storage.
my point is, you certainly are trying to troubleshooting with your tool box, why not get one more guy to the game so that he can share his tool with you? I'm not trying to make it sounds big, but SQL Server is simply not mysql or Postgre, compare it to Oracle is a better positioning.
one extra fun point: I wonder how many companies in the world is actually rich enough to afford those SQL licenses, if one can spin hundreds up with docker. :)
(p.s. I totally understand you are doing this for integration test, I did something similar 5 years ago without docker).
After some deliberation I figured out what the problem is. In hindsight it's pretty obvious, but hindsight's 20-20.
The culprit is the union filesystem, in particular overlay's copy-on-write implementation.
Data gets put in the image in the build process, in our case that's the database.
Then, on first modification to a file, that file is copied from the lower directory to the upper directory of the overlay filesystem and persisted there with its modification.
So, on sqlservr startup, the database files get modified by sqlservr in the merged directory and thus have to be completely copied over. In our case that's ~40gb, explaining the 120s delay. To sqlservr it just looks like the files are taking forever to access, thus the In Recovery
.
Also explains why volumes work, why this doesn't happen on container stop&start and why it scales with db size.
Hey @Jejuni, how do you manage when you want to create a new container with a fresh data volume?
I've changed my docker run from
docker run -p 1402:1433 --name myhugedbimage01 -d myhugedbimage
to
docker run -p 1403:1433 -v sql02:/var/opt/mssql --name myhugedbimage02 -d myhugedbimage
But the only change noticed is that instead of a container running relatively quickly then taking 10mins~ in recovery, it's taking 10mins~ to fill out the data volume and create the container with a working database. So in the end they're both much the same. Is there a better way than this..?
I am with the exact same problem, you could find out how to fix it?
@stupied4ever: check out the comment a few lines above: https://github.com/microsoft/mssql-docker/issues/303#issuecomment-702229652. It's the time it takes to copy the db file(s) to the read/write layer. The simplest "fix" is to shrink your database before embedding in the container; granted this isn't always possible. The other "fix" is to get comfortable with the wait by perhaps running this container and grabbing a cup of tea or running this container a few minutes before you need it; granted this is usually less desirable.
@robrich Thx for the attention. I Already did the shrink and it did the things faster (as the database is much smaller), but i still have the "Waiting until recovery". If someone find some alternative, please post here.
@twright-msft I made sure sql server wasn't PID 1, and manually started sql server, and I still get the same behavior. Looks like @Jejuni is correct. I am going to to move my data into a separate container than sql server and use that data container as a volume mapped to my sql server container. When i need a new container, i'll just have to create both at the same time.
I'm trying to build a pre-seeded image from a .bak file. I have a script to startup a
mssql-server-linux
container, add the .bak file, restore it, and thendocker commit
that to a new image. However, whenever I startup a container from that new image, it is in recovery mode for both tempdb as well as my DB I restored. The recovery takes a good 6-8min which is a pain. As a test, I verified that all was well with the temp container before committing as a new image. I restarted it several times, and it never does a recovery mode. However, as soon as I save as a new image and then start a container off that new image it has to go through recovery.Is something happening when I create a new image that forces them into recovery the next time it starts up?