microsoft / navcontainerhelper

Official Microsoft repository for BcContainerHelper, a PowerShell module, which makes it easier to work with Business Central Containers on Docker.
MIT License
379 stars 243 forks source link

Failed Images getting created #1605

Closed jwikman closed 3 years ago

jwikman commented 3 years ago

Describe the issue We're running a scheduled pipeline every night that are recreating images for all localizations if needed (new version released). Now and then there are some "dancing errors" when the image are being created, but the image is still created. If I rerun the same pipeline to try again it has always been working, hence the "dancing errors" term... ;-)

I would like the image build to fail if there are any issues, whatever they are, when the image is built.

As it is right now the image is created (even if f.ex. there are no BC service installed) and then all our pipelines that uses this image will fail with strange errors. If there instead would be no image for the localization a pipeline is running against, the image will be created on-the-fly (and fail if the second image creation fails as well, but that is rare)

I got an example of this last night.

Scripts used to create container and cause the issue Not that meaningful, since running the same script again works...

New-BcImage -artifactUrl "https://bcartifacts.azureedge.net/sandbox/17.2.19367.20652/be" -imageName "current:be" -LicenseFile $LicenseFile -multitenant:$false -includeTestToolkit -includeTestLibrariesOnly -memory "8G"

Full output of scripts I don't think that the full output is meaningful, since it's not solving this particular error that is the purpose here, it's just a general error handling that is missing. Let me know if you want full output anyway... Example of creation of a image that had an error below.

Step 1/6 : FROM mcr.microsoft.com/businesscentral:10.0.17763.1637
 ---> 0f9096121e62
Step 2/6 : ENV DatabaseServer=localhost DatabaseInstance=SQLEXPRESS DatabaseName=CRONUS IsBcSandbox=Y artifactUrl=https://bcartifacts.azureedge.net/sandbox/17.2.19367.20652/be
 ---> Running in 50778f11db79
Removing intermediate container 50778f11db79
 ---> c3b7f708de20
Step 3/6 : COPY my /run/
 ---> 848baf11de35
Step 4/6 : COPY NAVDVD /NAVDVD/
 ---> ec60c561e5d2
Step 5/6 : RUN \Run\start.ps1 -installOnly -includeTestToolkit -includeTestLibrariesOnly
 ---> Running in b64468ff3d8e
Using installer from C:\Run\150-new
Installing Business Central
Installing from DVD
Starting Local SQL Server
Starting Internet Information Server
Copying Service Tier Files
Copying PowerShell Scripts
Copying dependencies
Copying ReportBuilder
Importing PowerShell Modules
AuthorizationManager check failed.
Removing intermediate container b64468ff3d8e
 ---> 8d3b97e7701b
Step 6/6 : LABEL legal="http://go.microsoft.com/fwlink/?LinkId=837447"       created="202101080010"       nav=""       cu=""       country="BE"       version="17.2.19367.20652"       platform="17.0.19353.20441"
 ---> Running in 4fa996362907
Removing intermediate container 4fa996362907
 ---> 49962c28dbc5
Successfully built 49962c28dbc5
Successfully tagged current:be
Building image took 248 seconds

As you can see above, something failed in step 5, AuthorizationManager check failed, but it continues to create the image. Now we've got an image that has no BC installed...

Additional context

tfenster commented 3 years ago

Same here. Maybe five times instead of ten but it really can be tough to identify the source of the problem if this happens

freddydk commented 3 years ago

Can you go back and see some of the other failed instances? Is it always failing in this location? If not, can you provide other examples? Thanks

freddydk commented 3 years ago

No need - I know what is wrong here. Will fix this in the next generic image.

freddydk commented 3 years ago

The problem here is in start.ps1 in the generic image. Originally this was designed to ignore failures when installing/running BC in order to keep the container running and allow you to get access to eventlog, reconfigure and restart the service tier etc. This means that any error happening during install is ignored. When building an image this of course should not happen. I will deploy new generic images with this fix one of the next days.

freddydk commented 3 years ago

BTW - it could still be relevant to check whether the common error here is AuthorizationManager failed. If that is the case, then that is probably a timing issue. AuthorizationManager failed is thrown when a file is blocked or when Windows somehow has a lock on the file.

I have seen before that when copying stuff into a container, it sometimes takes a little while until things are available in the container.

I will make the first import-module (after copying) a little more resilient (wait 10 seconds and retry) This fix will be included in the next generic.

freddydk commented 3 years ago

If you set this configuration setting:

{
    "genericImageName": "mcr.microsoft.com/businesscentral:{0}-dev"
}

Then you wll get generic 1.0.1.3 preview, Likely to be shipped within the next few days.

jwikman commented 3 years ago

Thanks Freddy

Since there was no error in the image rebuild pipeline when hitting this issue, it will be quite hard to find the last occurrences of this... When we have hit strange issues when creating a container from images we have just dropped the image and recreated them. But this time I realized why we had faulty images and reported here instead...

Since we are creating images in parallell, it's likely a timing issue where the different pipelines are interfering with each other.

I suggest that we will wait until this starts to throw errors instead - and then we'll report specific issues when we get any. Ok?

freddydk commented 3 years ago

Ok, great - and if you are saved by the delay you will see: Error: 'AuthorizationManager check failed.', Retrying in 10 seconds...

jwikman commented 3 years ago

I realized that it was not that hard to find those occurrences after all. The image rebuild jobs that has this issue takes about 5 to 7 minutes and the jobs that succeeds takes 10+ minutes.

So I just looked through the last month of this pipeline runs and it turned out that this issue was more common than I first thought, but it was just on localizations that we hadn't been using when they were faulty. I saw this 10+ times in the last month...

But all was with the "AuthorizationManager check failed." error - so hopefully your fix will save us from this!

freddydk commented 3 years ago

gr8, and if the retrying doesn't work, we will have a look at why then later...

freddydk commented 3 years ago

BTW - let me know when you have run a few pipelines on the new -dev image, thanks. I have run all my tests, they work and the changes aren't that big.

jwikman commented 3 years ago

Last night I deleted all images and ran the pipeline - it created 30+ images without issues. And it was using Generic Tag: 1.0.1.3

But I could not find any retries after "AuthorizationManager check failed", so we probably need to wait a week or so until that shows upp in the logs...

The nextminor images has been used in a a lot of pipelines in our scheduled builds this morning, and they seems fine.

I saw your code change, nice and small. It should be fine..

freddydk commented 3 years ago

I will roll out 1.0.1.3 to public this week, thanks.