microsoft / navcontainerhelper

Official Microsoft repository for BcContainerHelper, a PowerShell module, which makes it easier to work with Business Central Containers on Docker.
MIT License
376 stars 243 forks source link

New-BCContainer fails to create container if using artifact (instead of image) with Transparent network #1189

Closed phamhainguyen68 closed 4 years ago

phamhainguyen68 commented 4 years ago

Issue: New-BCContainer fails midway, when creating container with Transparent network using artifact

Troubleshooting step:

  1. Running New-BCContainer with default NAT network: no issue
  2. Running New-BCContainer using image, instead of artifact with both Transparent and NAT networks: works in both scenarios
  3. The output of the scripts suggests it fails at restoring database (New-NAVDatabase at line 182 of C:\run\navinstall.ps1 script)
    Determining Database Collation from c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak
    Restoring CRONUS Demo Database
    The database restore operation failed due to the following error returned by SQL Server:
    A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (
    provider: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified)
    at <ScriptBlock>, C:\Run\navinstall.ps1: line 182
    at <ScriptBlock>, C:\Run\start.ps1: line 202
    at <ScriptBlock>, <No file>: line 1Error
  4. With that information, I logged into the container (using docker exec) and manually run the New-NAVDatabase command:
    • Exactly same error:
      
      New-NAVDatabase -DatabaseServer "localhost" -DatabaseInstance "SQLEXPRESS" -DatabaseName "CRONUS" -FilePath "c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak" -DestinationPath "c:\databases" -Timeout 300

New-NAVDatabase : The database restore operation failed due to the following error returned by SQL Server: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: SQL Network Interfaces, error: 26 - Error Locating Server/Instance Specified) At line:1 char:1

DatabaseServer DatabaseInstance DatabaseName


localhost\SQLEXPRESS CRONUS


**Conclusion** In Transparent Network, New-NAVDatabase command can only connect to SQL server when specifying server name and instance in DatabaseServer parameter as **localhost\SQLEXPRESS**.

**Additional information** with a bit further testing, I found out that this issue also applies to image method. We never caught it because image method does not have database restore process through New-NAVdatabase command.

**Scripts used to create container and cause the issue**

$artifactUrl=Get-BCArtifactUrl -type sandbox -country us -select Latest Write-Host -ForegroundColor Yellow "Using this artifact $artifactUrl to create container"

New-BCContainer -accept_eula -accept_outdated -containerName TEST -memoryLimit 8G -artifactUrl $artifactUrl -alwaysPull -doNotCheckHealth -auth Windows -restart unless-stopped -shortcuts Desktop -doNotExportObjectsToText -additionalParameters @('--network=TransparentNet --security-opt "credentialspec=file://gMSAnavnst.json"')


**Full output of scripts (please look at the last 15 lines)**

Using this artifact https://bcartifacts.azureedge.net/sandbox/16.3.14085.15298/us to create container NavContainerHelper is version 0.7.0.23 NavContainerHelper is running as administrator Host is Microsoft Windows Server 2019 Standard - ltsc2019 Docker Client Version is 19.03.5 Docker Server Version is 19.03.5 Fetching all docker images Pulling image mcr.microsoft.com/dynamicsnav:10.0.17763.1339-generic 10.0.17763.1339-generic: Pulling from dynamicsnav Using image mcr.microsoft.com/dynamicsnav:10.0.17763.1339-generic Disabling Health Check (always report healthy) Creating Container TEST Version: 16.3.14085.15298-US Style: sandbox Platform: 16.0.14073.15236 Generic Tag: 0.1.0.13 Container OS Version: 10.0.17763.1339 (ltsc2019) Host OS Version: 10.0.17763.1339 (ltsc2019) Using locale en-US Using process isolation Disabling the standard eventlog dump to container log every 2 seconds (use -dumpEventLog to enable) Additional Parameters: --network=TransparentNet --security-opt "credentialspec=file://gMSAnavnst.json" Files in C:\ProgramData\NavContainerHelper\Extensions\TEST\my:

freddydk commented 4 years ago

Could you try to add

-imageName myimg

to the command to make containerhelper create an image on the fly and run that.

phamhainguyen68 commented 4 years ago

Thank you for the fast response, Fred. Using imageName fixes the issue.

I guess this is because during the creation of the image, the temp container is spun up using NAT network and is able to restore the database. After the image has been created, the command builds the real container using the normal image method, hence no issue.

This does not fix the root cause (New-NAVDatabase) though. Could you escalate this to your internal team to see if they have any idea?

freddydk commented 4 years ago

Could it be that the issue is not the network but the gMSA setting? This is kind of the same problem as is reported here: https://github.com/microsoft/nav-docker/issues/478

phamhainguyen68 commented 4 years ago

Hi Freddy, It is indeed related to the use of gMSA. If I do not use that switch, the database restore step runs fine.

However, my situation is a bit different from the one in microsoft/nav-docker#478. The SQL service in my container does not stop after the collation change. It runs all the time.

In my case, the root issue is in the New-NAVDatabase command. The command is only able to connect to the SQL server using -DatabaseServer localhost\SQLEXPRESS (default setting in navinstall.ps1 script is -DatabaseServer localhost)

phamhainguyen68 commented 4 years ago

Update: New-NAVDatabase command also works with -DatabaseServer "localhost,1433" -DatabaseInstance "SQLEXPRESS" (i.e. specifying the default SQL port, 1433).

phamhainguyen68 commented 4 years ago

Update: using -Verbose switch for New-NAVDatabase command, I found out that, if we use localhost for -DatabaseServer switch, somehow, the command converts it to the gMSA account name when executing:

New-NAVDatabase -DatabaseServer "localhost" -DatabaseInstance "SQLEXPRESS" -Verbose
VERBOSE: Performing the operation "New-NAVDatabase" on target "DatabaseServer = gMSAnavnst, DatabaseInstance = SQLEXPRESS, DatabaseName = CRONUS".

That is not the case for localhost,1433 or localhost\SQLEXPRESS

New-NAVDatabase -DatabaseServer "localhost,1433" -DatabaseInstance "SQLEXPRESS" -Verbose
VERBOSE: Performing the operation "New-NAVDatabase" on target "DatabaseServer = localhost,1433, DatabaseInstance = SQLEXPRESS, DatabaseName = CRONUS".
VERBOSE: Restoring database 'CRONUS' from backup file: 'c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak'...
VERBOSE: Restore of database 'CRONUS' from file 'c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak' completed 
successfully.
New-NAVDatabase -DatabaseServer "localhost\SQLEXPRESS" -Verbose
VERBOSE: Performing the operation "New-NAVDatabase" on target "DatabaseServer = localhost\SQLEXPRESS, DatabaseInstance = , DatabaseName = CRONUS".
VERBOSE: Restoring database 'CRONUS' from backup file: 'c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak'...
VERBOSE: Restore of database 'CRONUS' from file 'c:\dl\sandbox\16.3.14085.15298\us\BusinessCentral-US.bak' completed successfully.

Edit: I think New-NAVDatabase command is hardcoded to convert "localhost" to the DNS/hostname of the server. When I use gMSA, the container will use gMSA's account name for its network name. With this finding, the quickest ways to overcome this is either modify the New-NAVDatabase command to use localhost\SQLEXPRESS, or just use imagename switch like you suggested

C:\Windows\System32\drivers\etc> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : gMSAnavnst
   Primary Dns Suffix  . . . . . . . : ourdomain.com
   Node Type . . . . . . . . . . . . : Hybrid
   IP Routing Enabled. . . . . . . . : No
   WINS Proxy Enabled. . . . . . . . : No
freddydk commented 4 years ago

That is really interresting. If you do not use gMSA, then I assume that New-NavDatabase won't use the hostname right? My assumption is that this is a problem with gMSA and not with New-NavDatabase, right?

phamhainguyen68 commented 4 years ago

Hi Freddy,

Regardless of using gMSA or not, the New-NAVDatabase always uses hostname if specify localhost in DatabaseServer switch.

The reason why the command works when not using gMSA is because in this situation, the hostname is the same as the container's name, not the gMSA's account name.

For example below, I created a container named TEST without gMSA. In this case, the hostname of the container is also TEST. When I ping this hostname from within the container, it is ping-able since it is equal to localhost (127.0.0.1) internally.

New-NAVDatabase -DatabaseServer "localhost" -DatabaseInstance "SQLEXPRESS" -DatabaseName "CRONUS" -Verbose
VERBOSE: Performing the operation "New-NAVDatabase" on target "DatabaseServer = TEST, DatabaseInstance = SQLEXPRESS, DatabaseName = CRONUS".
C:\Windows\system32> docker exec -it TEST powershell
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.
   Host Name . . . . . . . . . . . . : TEST
PS C:\Program Files\Microsoft Dynamics NAV\160\Service> ping TEST

Pinging TEST [fe80::e4cd:d17f:aa3e:f4d7%46] with 32 bytes of data:
Reply from fe80::e4cd:d17f:aa3e:f4d7%46: time<1ms
Reply from fe80::e4cd:d17f:aa3e:f4d7%46: time<1ms
Reply from fe80::e4cd:d17f:aa3e:f4d7%46: time<1ms
Reply from fe80::e4cd:d17f:aa3e:f4d7%46: time<1ms

When using gMSA, the hostname inside the container is gMSAnavnst (the gMSA account name). This hostname is just a virtual one, (I think) used by the container to create relationship with our domain. When you try to ping it from within the container, it could not resolve it. This is why New-NAVDatabase command fails.

New-NAVDatabase -DatabaseServer "localhost" -DatabaseInstance "SQLEXPRESS" -DatabaseName "CRONUS" -Verbose
VERBOSE: Performing the operation "New-NAVDatabase" on target "DatabaseServer = gMSAnavnst, DatabaseInstance = SQLEXPRESS, DatabaseName = CRONUS".
PS C:\Program Files\Microsoft Dynamics NAV\160\Service> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : gMSAnavnst
PS C:\Program Files\Microsoft Dynamics NAV\160\Service> ping gMSAnavnst
Ping request could not find host gMSAnavnst. Please check the name and try again.
freddydk commented 4 years ago

Looked in the source code for New-NavDatabase and you are right - the switch happens there. But - it happens deep down in helper files used by lots of other things and New-NavDatabase behaves differently if it determines that the database is on a local server. This means that there might be other places, where this switch happens and/or it might have other consequences to make the switch.

Could you try this:

I assume, that if you run

[System.Net.Dns]::GetHostEntry('localhost')

Inside the container, it returns your gMSAnavst.

Could you insert this (before the call to new-navdatabase):

set-content -Path 'c:\windows\system32\drivers\etc\hosts' -Value "127.0.0.1 $([System.Net.Dns]::GetHostEntry('localhost').HostName)"

Just as a test to see whether this fixes the issue.

This should cause all commands to know that this is localhost - and it would also cause other commands using the same low level function to resolve.

phamhainguyen68 commented 4 years ago

Hi Freddy,

Thank you for your suggestion. Modifying hosts file was my first thought as well, but I do not know how to do that during the creation of container, specifically, before the database restore step. I can do that manually by running PowerShell inside the container and it fixes the issue with New-NAVDatabase. However I have to wait for New-BCContainer command to fail and exit out.

I even tried to create a DNS record to point gMSAnavnst to 127.0.0.1 in our internal DNS server, but the container did not pick it up (it prioritizes its own internal record I guess).

Also, New-NAVDatabase gets the "localhost" data from the same source as ipconfig /all command, not from hostname or [System.Net.Dns]::GetHostEntry('localhost'). In normal scenario (i.e. no gMSA), all of these return the same result, TEST. However, in gMSA setup, ipconfig/all returns gMSAnavnst, while hostname or [System.Net.Dns]::GetHostEntry('localhost') returns TEST

PS C:\> ipconfig /all

Windows IP Configuration

   Host Name . . . . . . . . . . . . : gMSAnavnst
   Primary Dns Suffix  . . . . . . . : ourdomain.com

PS C:\> hostname
TEST2

PS C:\> [System.Net.Dns]::GetHostEntry('localhost')

HostName Aliases AddressList
-------- ------- -----------
TEST2    {}      {::1, 127.0.0.1}

Update: $env:computername also returns same result like ipconfig /all in gMSA scenario. So I guess, your command can be modified as set-content -Path 'c:\windows\system32\drivers\etc\hosts' -Value "127.0.0.1 $env:computername"

freddydk commented 4 years ago

Download https://raw.githubusercontent.com/microsoft/nav-docker/master/generic/Run/150-new/navinstall.ps1 and place it on your machine (ex. c:\temp\navinstall.ps1) Insert the line above in the beginning. run new-bcContainer with -myscripts @("c:\temp\navinstall.ps1") You could insert a Write-Host to see that your file gets executed.

phamhainguyen68 commented 4 years ago

Awesome! I will try this.

Thank you for the suggestion.

freddydk commented 4 years ago

Let me know if this worked, then I will add this line in the generic image if the host name starts with gmsa.

phamhainguyen68 commented 4 years ago

Hi Freddy,

It works! I put the command (to modify the host file) right after line 10 and 11 of navinstall.ps1 file. This is the command I used:

set-content -Path 'c:\windows\system32\drivers\etc\hosts' -Value "127.0.0.1 $env:computername"

You should not rely on the host name starts with gMSA condition, simply because people can name their gMSA account in whatever way they want. It does not have to start with gMSA. Example: BCgMSAaccount, which makes the host name starts with BC, not gMSA

I think you can just hardcode this command without any condition (i.e. it always runs), because it gets the host name from $env:computername, which:

PS C:> hostname TEST PS C:>

- returns the normal host name if you do not use gMSA. 

PS C:> $env:computername TEST

PS C:> hostname TEST



In other word, it works with both situations and do not have any negative consequence.
freddydk commented 4 years ago

I don't want to just always add the line - it will probably break something we don't think about right here.

But I could test whether computername is different from hostname - and if that is the case - then add computername to hosts like you do here.

Thanks for the investigations.

phamhainguyen68 commented 4 years ago

That is a good point!

Thank you very much for all the prompt and informative responses. Please let me know when you have implemented the change.

Have a good weekend!

freddydk commented 4 years ago

Generic image 0.1.0.14 has been released and the code is added to start.ps1 (not navinstall.ps1)

phamhainguyen68 commented 4 years ago

Hi Freddy,

Could you also seed the change to onprem artifacts? It only works with sandbox at the moment.

Thank you in advance!

phamhainguyen68 commented 4 years ago

@freddydk

Hi Freddy,

Could you also seed the change to onprem artifacts? It only works with sandbox at the moment.

Thank you in advance!

freddydk commented 4 years ago

This bug fix was not in the artifacts - it was in the generic image. If something is wrong in onprem containers - either you have a cached image or something else is wrong. copy/paste the full output of the onprem container generation and script.

phamhainguyen68 commented 4 years ago

@freddydk I just realized that as well. Turns out, I am now having exact same issue with this one: https://github.com/microsoft/nav-docker/issues/478. The SQL service just crashed after changing Collation. This only occurs for onprem image. I will follow up with you through that issue 478 from now on.

Determining Database Collation from c:\dl\onprem\16.4.14693.15445\na\database\Demo Database NAV (16-0).bak
Changing Database Server Collation to Latin1_General_100_CI_AS
Restoring CRONUS Demo Database
The database restore operation failed due to the following error returned by SQL Server:
A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allo
w remote connections. (provider: TCP Provider, error: 0 - The remote computer refused the network connection.)
at <ScriptBlock>, C:\Run\navinstall.ps1: line 182
at <ScriptBlock>, C:\Run\start.ps1: line 213
at <ScriptBlock>, <No file>: line 1
PS C:\> Get-Service *SQL*

Status   Name               DisplayName
------   ----               -----------
Stopped  MSSQL$SQLEXPRESS   SQL Server (SQLEXPRESS)
freddydk commented 4 years ago

It will probably happen if you use a sandbox image, which uses a different collation than the original.