microsoft / SDN

This repo includes PowerShell scripts and VMM service templates for setting up the Microsoft Software Defined Networking (SDN) Stack using Windows Server 2016
Other
481 stars 542 forks source link

Failed to start service "Software load balancer host agent" (SlbHostAgent) #9

Closed deep-sky closed 8 years ago

deep-sky commented 8 years ago

This is happening on the step 6 of the SDNExpress script. What may be the reason of that behavior potentially? Sometimes the same happens to NC host agent service. A single host configuration is used with a single NIC. Thanks.

grcusanz commented 8 years ago

Single host with single nic generally works fine. When the SLB host agent service won't start its usually because the information used to build the c:\windows\system32\slbhpcfg.xml on the host was not correct. Parameters to check in the config file:

deep-sky commented 8 years ago

grcusanz, thank you for the prompt response and help. I have checked the data against all the items you have outlined above, and it seems like slbhpcfg.xml was missing IP for SlbmVipEndpoint part. This is how it was generated:

<?xml version="1.0" encoding="utf-8"?>
...
    <SlbManager>
        ...
        <SlbmVipEndpoints>
            <SlbmVipEndpoint>:8570</SlbmVipEndpoint>
        </SlbmVipEndpoints>
        ...
    </SlbManager>
    ...
</SlbHostPluginConfiguration>

_However_, what really interesting is the reason why the SLBMVIP was not populated in the XML. On the previous step the SLBMVIP was successfully assigned (the subnet is public and has a valid IP pool, where the 192.168.1.120 is the first IP in the pool's range, as expected):

VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] Found public subnet.
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] SLBMVIP is 192.168.1.120.
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] Checking f8f67956-3906-4303-94c5-09cf91e7e33.
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] subnet 192.168.1.0/24.
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] Payload follows:
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] {
    "resourceId":  "config",
    "properties":  {
        "vipIpPools":  [
        {
         "resourceRef": "/logicalnetworks/f8f67956-3906-4303-94c5-09cf91e7e311/subnets/852930a1-7712-45c1-9d20-ac6163c300d2/ipPools/0595b40e-ba70-41ef-b497-4dbb516cb459"
        }
        ],
    "OutboundNatIPExemptions":  [
        "192.168.1.120/32"
    ],
    "loadbalancermanageripaddress":  "192.168.1.120"
    }
}
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] PUT https://nc-08rest.sdn.local/Networking/v1/loadbalancermanager/config with -1-byte payload
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] received 595-byte response of content type application/json; charset=utf-8
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] JSON Get [/loadbalancermanager/config]
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] GET https://nc-08rest.sdn.local/Networking/v1/loadbalancermanager/config with 0-byte payload
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] received 595-byte response of content type application/json; charset=utf-8
VERBOSE: [NC-08]: [[Script]ConfigureSLBManager] Finished configuring SLB Manager

The reason the SLBMVIP was not populated in the XML is the [EnableVFP] routine responsible for port and festures settings. Once performed (no errors), the forwarding extension on the Hyper-V switch is enabled, and the REST name is no longer resolvable to IP. I purposely disabled forwarding extension to see if it allows the REST name to be resolved - and it does (I can make requests both via a browser and console), once the extension is disabled.

Could you please advise on that matter?


One more interesting thing... Before the [Enable VFP] call, there are [SetNCConnection] and [HostAgent] routines are performed. Though no errors, the NC Host Agent service is not running (started and terminated unexpectedly):

VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Resource ]  [[Script]SetNCConnection]
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Test     ]  [[Script]SetNCConnection]
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Test     ]  [[Script]SetNCConnection]  in 0.0000 seconds.
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Set      ]  [[Script]SetNCConnection]
VERBOSE: [HOME-WIN2016]:                            [[Script]SetNCConnection] Performing the operation "Set-TargetResource" on target "Executing the SetScript with the user supplied credential".
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Set      ]  [[Script]SetNCConnection]  in 0.0780 seconds.
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Resource ]  [[Script]SetNCConnection]
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Resource ]  [[Script]HostAgent]
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Test     ]  [[Script]HostAgent]
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Test     ]  [[Script]HostAgent]  in 0.0150 seconds.
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Set      ]  [[Script]HostAgent]
VERBOSE: [HOME-WIN2016]:                            [[Script]HostAgent] Performing the operation "Set-TargetResource" on target "Executing the SetScript with the user sup plied credential".
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Set      ]  [[Script]HostAgent]  in 0.7350 seconds.
VERBOSE: [HOME-WIN2016]: LCM:  [ End    Resource ]  [[Script]HostAgent]
...
VERBOSE: [HOME-WIN2016]: LCM:  [ Start  Set      ]  [[Script]EnableVFP]
...
grcusanz commented 8 years ago

Thanks for the detailed information. The host agent is what unblocks the ports after the VFP extension is enabled. Since the host agent is not running it can't unblock the ports.

On your Hyper-V host open up regedit and naviagate to HKLM:\SYSTEM\CurrentControlSet\Services\NcHostAgent\Parameters. Check that:

deep-sky commented 8 years ago

Hello, grcusanz. Thanks for suggestions.

>>>> On your Hyper-V host open up regedit and naviagate to 
HKLM:\SYSTEM\CurrentControlSet\Services\NcHostAgent\Parameters. Check that:

Here are my keys/values (seem to be good):

Connections                  |    ssl:192.168.1.99:6640 pssl:6640:
HostAgentCertificateCName    |    home-win2016.sdn.local
PeerCertificateCName         |    NC-08REST.SDN.LOCAL
ServiceDll                   |    %SystemRoot%\System32\NcHostAgent.dll

Also, I remember a case with HostAgent-related routine called [Firewall-HostAgent]

Script Firewall-HostAgent
{                                      
    SetScript = {
        Enable-netfirewallrule "Microsoft-Windows-Hyper-V-HostAgent"
        Enable-netfirewallrule "Microsoft-Windows-Hyper-V-HostAgent-WCF"
        Enable-netfirewallrule "Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS"
        ...

Script was erroring out stating that the rule for WCF-TLS (the last one) was not found in the system (thus, could not be enabled). Not sure why it was not there (a fresh WIN2016TP3 build 10514 is used). Then, I inspected details for the second rule (-WCF) and added a new similar one for TLS (port 6640), with the following parameters (omitting captions, etc. here):

.... -Enabled True -direction Inbound -LocalPort 6640 -action Allow -protocol "TCP"

The group was specified as "@%systemroot%\system32\NcHostAgent.dll,-502"

Below are the output lines from executing the following cmdlet:

 Get-NetFirewallRule Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS

 Name                  : Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS
 DisplayName           : Network Controller Host Agent WCF TLS (TCP-In)
 Description           : Allow inbound port 6640 TCP traffic to Network Controller Host Agent for WCF (TLS)
 DisplayGroup          : Network Controller Host Agent Firewall Group
 Group                 : @%systemroot%\system32\NcHostAgent.dll,-502
 Enabled               : True
 Profile               : Any
 Platform              : {}
 Direction             : Inbound
 Action                : Allow
 ....

By the way, the following routines worked without issues as well:

Firewall-REST
Firewall-OVSDB

Certificates were created and distributed successfully [by the SDNExpress script] across the infrastructure. No certificate-related issues so far. Appreciate your efforts spent on it.

deep-sky commented 8 years ago

Attaching some details on host agent crash obtained from the System Event Log, just in case:

The NC Host Agent service entered the running state.

[ in 1 second or less ]

The NC Host Agent service terminated unexpectedly.

Details:

Faulting application name: svchost.exe_NcHostAgent, version: 10.0.10514.0, time stamp: 0x55c6a396
Faulting module name: ucrtbase.dll, version: 10.0.10514.0, time stamp: 0x55c6a2ee
Exception code: 0xc0000409
Fault offset: 0x000000000006145e
.... 
Faulting module path: C:\WINDOWS\SYSTEM32\ucrtbase.dll
....
grcusanz commented 8 years ago

You mention you're using "WIN2016TP3 build 10514", I suspect that is your problem because the format of the string in the connections parameter has changed in between TP3 and TP4. This version of the SDNExpress scripts is designed for TP4 only. Please download TP4 from here, update your host OS and the VHDs for the VMs created by the script, and try again.

deep-sky commented 8 years ago

grcusanz, thanks for the tip! I'll try TP4.

deep-sky commented 8 years ago

The TP4 worked well! The topology was created successfully, both NC Host and SLB Host agent services are healthy now.

Tenant VMs with web app sample were also deployed and config script (IIS/RDP) is executed smoothly. The /loadBalancers/_TenantName__SLB structure for load balancer is fulfilled with data and available via the REST service. OK. I can also see front-end and back-end IP configurations for the SLB which corresponds to the data in tenant config. Very nice so far.

Now, should I get tenant VMs balanced by simply accessing balancer via its front-end IP? https? Because I don't get it this way in the browser (front-end and back-end ports are both 80). Could you please clarify?

More questions in my head... Should tenant VMs be accessible via their IPs from the host? I think a "direct server response" routing model is used by the forwarding switch extension, right? Or two-arms bridge mode? These details would be awesome to know.


p.s. there are some typos in the SDN/SDNExpress/TenantApps/DBTier.ps1 file:

line 1: it should be "pow_er_shell" instead of "pow_re_shell" line 40: DbTier instead of WebTier

grcusanz commented 8 years ago

Very good! When you run the SDNExpressTenant script, it will create the VMs and place them into a virtual network. That virtual network by default is completely isolated from the outside world. The script then adds a VIP that provides inbound access to the WebTier only on port 80. You are correct about direct server response being used. The response from the WebTier VMs will go directly out to the client machine from the host without going back through the mux. This is because the forwarding extension is able to perform the snat on the outbound path automatically, which give it the advantage of appearing to be going back through the mux without actually doing so. The script also configures outbound NAT for the WebTier VMs, so they should also be able to reach the outside world.

For verification that everything in your environment is working you should first disable the windows firewall in the guest VMs and ping from the dbtier to the webtier. Then enable the web server role on the webtier VMs. This is not currently being done automatically. Note that the web tier VMs will not be able to ping each other due to an example of an ACL that is included that prevents them from seeing each other. If pings within the virtual network are successful, then next verify that the BGP peering is successful and that the VIP is advertised to your router via BGP. If that looks good, you should be able to reach the VIP with http on port 80 from the outside. Also be aware that the VIP will not respond to ping, so don't be concerned if you try to ping it and don't get a response.

To enable https, you will need to modify the SDNExpressTenant script to add an additional load balancing rule for port 443. It is our hope that you will modify the tenant script as necessary to meet the needs of your applications. Please provide feedback on how it works out for you.

Thanks!

deep-sky commented 8 years ago

Hello grcusanz, thank you for detailed response. I have several questions though, as these final steps definitions are quite hard to find in the technet docs on SDN yet. I tried to review the SDNExpressTenant script and found a hard-coded DNS server IP on the line 372. It says "10.60.34.9" while it should be $Network.DNSServers[0] as far as I can see. Right?

>>>> you should first disable the windows firewall in the guest VMs

Did you mean all guest VMs, including the mux, management gateway, network controller? Should I keep the firewall on the the host machine? I suppose no, as it uses ACL-based routing, correct?

>>>> Then enable the web server role on the webtier VMs. 
>>>> This is not currently being done automatically. 

Is this supposed to be enabled via Features/Role panel in the Server Manager, or this must be done by invoking the script from the config folder in the root? I am asking about this, as running the script enables IIS/ASPNET/RDP, which are just a part of capabilities of the web server role available via Server Manager. Sure, IIS/ASPNET seem enough to have at this point, but it's good to know which way exactly supposed to go (I ran the script).

>>>> Note that the web tier VMs will not be able to ping each other due to an example of 
>>>> an ACL that is included that prevents them from seeing each other. 

This is not true for me: with the firewall disabled on these VMs, I can ping between VMs of the web tier subnet. It seems like ACLs are not in effect.

So, currently I have separate subnets for:

 Web tier VMs: 192.168.2.10 and 192.168.2.11 (and can ping between them)
 DB Tier VM: 192.168.3.10 (and can ping VMs from the web tier)
 SLB Manager at 192.168.1.120 and Tenant_SLB itself at 192.168.1.125 
 --- like I described in my messages before
 Host: I can't access web tier VMs via http:80 (ok), but can't access them via the Tenant_SLB:80 either.

I know it's really hard to say for sure, as the SDN deployment case has too many egdes, but am I missing something at this point? Appreciate!

deep-sky commented 8 years ago

Feel free to ask for any related data if you need it. Thanks.

deep-sky commented 8 years ago

I'd appreciate any thoughts on having ping working between tenant web tier VMs and SLBMVIP not working on the host at the same time. Thanks!

AlexisLessard commented 8 years ago

Hello! I know this issue was for TP3, but I'm having the same issue for TP5 after using the VMM template. The c:\windows\system32\slbhpcfg.xml is not present, so I'm gessing that changed from tp3 to tp5. Here's the output of the Get-NetFirewallRule Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS cmdlet `

Get-NetFirewallRule Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS

Name                  : Microsoft-Windows-Hyper-V-HostAgent-WCF-TLS
DisplayName           : Network Controller Host Agent WCF over TLS (TCP-In)
Description           : Allow inbound port 443 HTTPS traffic to Network Controller Host Agent for WCF
DisplayGroup          : Network Controller Host Agent Firewall Group
Group                 : @%systemroot%\system32\NcHostAgent.dll,-502
Enabled               : True
Profile               : Any
Platform              : {}
Direction             : Inbound
Action                : Allow
EdgeTraversalPolicy   : Block
LooseSourceMapping    : False
LocalOnlyMapping      : False
Owner                 :
PrimaryStatus         : OK
Status                : The rule was parsed successfully from the store. (65536)
EnforcementStatus     : NotApplicable
PolicyStoreSource     : PersistentStore
PolicyStoreSourceType : Local`

Here are the values from HKLM:\SYSTEM\CurrentControlSet\Services\NcHostAgent\Parameters

Connections: ssl:10.12.104.14:6640 pssl:6640 HostAgentCertificateCName:ul-stw-ts-hst01.ul.ca PeerCertificateCName:ul-stw-ts-nc1.ul.ca ServiceDLL: %SystemRoot%\System32\NcHostAgent.dll

seems all good. However the firewall rule doesnt seem correct as it Allow inbound port 443 HTTPS traffic to Network Controller Host Agent for WCFm and the connection port is written as pssl. I'll try to either create a new rule or juste shutdown the firewall entirely. For test purposes it'll do. But do you know of any other issues that could have happened?

AlexisLessard commented 8 years ago

Here's another output:

PS C:\Users\*******> Start-Service SlbHostAgent

Start-Service : Service 'Software Load Balancer Host Agent (SlbHostAgent)' cannot be started due to the following
error: Cannot open SlbHostAgent service on computer '.'.
At line:1 char:1
+ Start-Service SlbHostAgent
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Start-Service],
   ServiceCommandException
    + FullyQualifiedErrorId : CouldNotStartService,Microsoft.PowerShell.Commands.StartServiceCommand

PS C:\Users\******> $error[0]|Format-List -Force

writeErrorStream      : True
Exception             : Microsoft.PowerShell.Commands.ServiceCommandException: Service 'Software Load Balancer Host
                        Agent (SlbHostAgent)' cannot be started due to the following error: Cannot open SlbHostAgent
                        service on computer '.'. ---> System.InvalidOperationException: Cannot open SlbHostAgent
                        service on computer '.'. ---> System.ComponentModel.Win32Exception: Access is denied
                           --- End of inner exception stack trace ---
                           at System.ServiceProcess.ServiceController.GetServiceHandle(Int32 desiredAccess)
                           at System.ServiceProcess.ServiceController.Start(String[] args)
                           at
                        Microsoft.PowerShell.Commands.ServiceOperationBaseCommand.DoStartService(ServiceController
                        serviceController)
                           --- End of inner exception stack trace ---
TargetObject          : SlbHostAgent
CategoryInfo          : OpenError: (System.ServiceProcess.ServiceController:ServiceController) [Start-Service],
                        ServiceCommandException
FullyQualifiedErrorId : CouldNotStartService,Microsoft.PowerShell.Commands.StartServiceCommand
ErrorDetails          :
InvocationInfo        : System.Management.Automation.InvocationInfo
ScriptStackTrace      : at <ScriptBlock>, <No file>: line 1
PipelineIterationInfo : {0, 1}
PSMessageDetails      :

mmm access is denied.I am running this as administrator on my hyper-v host. Maybe by running it as a different user?

JMesser81 commented 8 years ago

@AlexisLessard - could you please report this issue on our TechNet forum: https://social.technet.microsoft.com/Forums/en-US/home?forum=WinServerPreview

I will reach out to the SCVMM team to have someone contact you.

Thanks

AlexisLessard commented 8 years ago

@JMesser81 Reported the issue here: https://social.technet.microsoft.com/Forums/en-US/3b328b97-9aad-458a-9c20-e7cff86e86de/issue-with-sdn-templates-software-load-balancer-host-agent-is-not-running-on-the-computer?forum=WinServerPreview

benpiper commented 7 years ago

I'm having this issue as well with RTM. The SlbmVipEndpoint in the XML file just has :8570 without the VIP. Prior to this, the script does assign the SLBMVIP an IP address from one of the IP pools, bthe HTTP PUT for the loadbalancermanager yields a Cannot parse the request error