microsoft / IIS.ServiceMonitor

An entrypoint process for running IIS in Windows containers
MIT License
127 stars 39 forks source link

ServiceMonitor causes microsoft/iis docker container to exit with error #34

Closed robertoandrade closed 5 years ago

robertoandrade commented 6 years ago

Containers die with the following log:

Service 'w3svc' has been stopped
APPCMD failed with error code 259
Failed to update IIS configuration

Not sure if this is related at all with #29 and #4 but have been seeing this when running the latest tag of microsoft:iis on a Windows Server 2016 AMI on AWS EC2.

mcy94w commented 6 years ago

Hi @robertoandrade : Could you please share you Image so I can take a look?

robertoandrade commented 6 years ago

I'm running microsoft/iis:latest no mods.

mcy94w commented 6 years ago

Did you hit this issue recently?

robertoandrade commented 6 years ago

Yes, only starting to mess with containers on Windows, after a few years of Unix. Saw this first yesterday but since I saw similar issues reported against the IIS docker repo and people seemed to always point to it being a ServiceMonitor issue that had been fixed and incorporated into the latest images I thought this could be another one of those, but different since the error code is not the same.

michha commented 6 years ago

Maybe your latest image is an old one. As of today microsoft/iis:latest has an image id of b8f924611ebb (created 3 weeks ago). Try a docker pull microsoft/iis to ensure having the latest version. I had no problems starting a container from this image.

robertoandrade commented 6 years ago

That's exactly the same I have. A way to repro I guess is setting up an ECS cluster which launches EC2 instances on AWS with their ECS-Optimized AMI (which is preloaded with Docker and both windowsservercore and nanoserver images) and then pull and try to run the container via an ECS task.

When I run it with t2.medium instance types it seems to fail with that code, once I switch over to c5.large or m5.large the problem seems to go away.

vjgn commented 6 years ago

@robertoandrade, I am trying something similar to what you are doing. How did you configure your task in EC2. Look like EC2 has additional dependencies before launching the container. Exposing port 80 definitely conflicts with microsoft/iis image. From AWS docs (https://docs.aws.amazon.com/AmazonECS/latest/developerguide/windows_task_IAM_roles.html#windows_task_IAM_roles_bootstrap):

The IAM roles for the task credential provider use port 80 on the container instance, so if you enable IAM roles for tasks on your container instance, your containers cannot use port 80 for the host port in any port mappings

There are additional bootstrap steps that must wired to the container for it launch correctly in AWS ECS. Did you get further in your testing?

robertoandrade commented 6 years ago

Didn't do anything special, was of course exposing port 80 as a different one at the EC2 instance level to avoid that conflict, but interestingly enough, same task definition running on an ECS cluster with a different instance type works just fine, I only see problems when running the cluster with t2.medium/large instance types.

me-viper commented 6 years ago

I have the same issue. "APPCMD failed with error code 259." For me problem is reproduced only when containers are handled by AWS ECS. When I start same container on the same EC2 instance manually (aka docker run ...) everything works just fine.

robertoandrade commented 6 years ago

I noticed the same @me-viper but then when I switch the instance types of my ECS cluster to something higher than a t2 it ECS launches the container just fine without errors.

lugoues commented 6 years ago

I see the same thing as @me-viper, and it is inconsistent. The container will fail to launch repeatedly for a time (sometime hours), and then start working.

I'm already running m5.large and m5.xlarge instances, so it may not be related?

I grabbed all the event logs from a dead container and nothing seemed out of the ordinary, I can provide them if that would help.

mebyz commented 6 years ago

i'm experiencing the same issue on our aws ECS infrastructure

here are the steps we followed :

  1. create an aws ecs windows cluster (m5.large)
  2. create a task definition linked to our asp .Net app (a docker image stored on dockerhub)

when the task starts, here is what we see in the log (cloudwatch) :

" Service 'w3svc' has been stopped APPCMD failed with error code 259 Failed to update IIS configuration "

then, the tasks stops (obviously)

just to be sure the issue is not related to our app specifically, we also tried switching our task definition to this public dockerhub image : microsoft/aspnet

but the issue is the same, and the logs are showing the same error(s)

me-viper commented 6 years ago

At this point I have a strong feeling that it's not ServiceMonitor failing but there is weird interaction with Amazon ECS. Still trying different scenarios, but:

That kida makes sense, because it explains why starting containers manually works fine - ECS service does not care about them. Still have to confirm that it wasn't random luck but at least it's something.

mebyz commented 6 years ago

you're perfectly right @me-viper !

🐰 First I want to confirm that the docker image containing my app is valid : when i docker run it manually, there is no problem, even when i try directly on the ECS host instance my app runs just fine.

that fact, plus your last comment ("weird interaction with Amazon ECS") lead me to consider that the problem only occurs when my app container is started BY the ECS agent

So I wondered how ECS agent could impact the execution of the container context, and i remembered that the agent actually uses the values I entered myself in the Task Definition panel to run my container such as : network mode, memory and cpu limits, etc etc

I eventually managed to reach a stable state by tweaking my task definition (memory, cpu limits) and now my ECS service runs just fine ! 👍👍

here you can see the values i used, which made the service become stable again :

amazon_ecs
mebyz commented 6 years ago

Now, i'm pretty confident in the fact that the problem comes from a "memory, or ressource issue", maybe resulting in iss startup failure in the container, and the subsequent container death.

  1. I reached a steady state after tweaking my ecs task definition (ie my task's memory limits)

  2. That could explain the inconsistency @Lugoues experienced, as the ecs instance ressources can vary over time ( relatively to the instance usage )

  3. That could also explain why @robertoandrade told us his app runs fine on t5 instances while experiencing fails on t2. instances : they have less memory, cpu, ..

If you guys confirm me tweaking your task definition solves the issue for you ( as show n in my last comment ) i'll open a ticket on aws support page

Maybe they can add something to their docs about it, or fix their task creation workflow

me-viper commented 6 years ago

Hopefully I've got good news (sorry for formatting - GitHub is not cooperating today). So, I was digging through the ECS Agent source code and found reference to the following issue #1127. I've also observed my ECS launched containers to have CpuPercent=1 which, indeed, seems like the root of the problem - container just doesn't get enough computing power (and that is why it working on larger instances). Now, if you look into actual "temporary fix" (agent version 1.17.1) there is magic environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND. I've set it to true and restarted ECS Agent and all my container instances started working (docker inspect shows CpuPercent=0). So:

  1. Leave "Task CPU" in your task definition blank
  2. In EC2 Instance, hosting ECS, add environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true
  3. Restart ECS Agent service.

The drawback of this workaround is that your container instances will use all available CPUs.

me-viper commented 6 years ago

One more bit that makes picture complete. Task definition UI is pretty misleading, to say the least. We've got two things that affect CPU:

  1. Task definition itself:

    taskcpu
  2. Container definition:

    containercpu

From what I observe, if you set Task CPU > 0 (screen 1) you have to set CPU units to something > 0 (screen 2) otherwise you'll end up with CPUPercent=1. If you don't set Task CPU you need to set ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true or you, again, end up with CPUPercent=1.

I guess, that makes some sense but, for me personally, it was far from obvious.

mebyz commented 6 years ago

Brilliant @me-viper !

bariscaglar commented 6 years ago

@mcy94w please take a look at the error message produced by appcmd and see if service monitor can translate it to something more meaningful.

bariscaglar commented 6 years ago

From a code point of view, we were not able to find the lower limit one needs to run the IIS container. @shirhatti will experiment to find an empirical limit.

rymancl commented 6 years ago

@me-viper - For some reason the environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true isn't working for me. I set the environment variable, restarted the agent service, and even deployed new tasks.

I have ECS Agent version 1.18.0 on the latest ECS Optimized AMI for Windows. Instance is m4.xlarge.

I do not set Task CPU or Task Memory. If I set Container CPU and Container Memory, all is fine, tasks are perfect. If I set Container CPU to 0 or do not set it, my tasks continually go to STOPPED with the error mentioned in this issue.

I have inspected the STOPPED containers and confirmed CpuPercent is still being set to 1, not 0. If I set my Container CPU, CpuPercent get set correctly (ex: 12).

Any idea what may be wrong here?

me-viper commented 6 years ago

@me-viper - For some reason the environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true isn't working for me. I set the environment variable, restarted the agent service, and even deployed new tasks.

I have ECS Agent version 1.18.0 on the latest ECS Optimized AMI for Windows. Instance is m4.xlarge.

I do not set Task CPU or Task Memory. If I set Container CPU and Container Memory, all is fine, tasks are perfect. If I set Container CPU to 0 or do not set it, my tasks continually go to STOPPED with the error mentioned in this issue.

I have inspected the STOPPED containers and confirmed CpuPercent is still being set to 1, not 0. If I set my Container CPU, CpuPercent get set correctly (ex: 12).

Any idea what may be wrong here?

I've gone through ECS Agent Code and it seems ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND affects only Task CPU, you still have to set Container CPU to some meaningful value. Basically, if you use this flag your tasks will be allowed to use all resources specified by Container CPU.

Personally, I'd avoid using this flag. Just set both Task CPU and Container CPU and call it a day.

rymancl commented 6 years ago

@me-viper - For some reason the environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true isn't working for me. I set the environment variable, restarted the agent service, and even deployed new tasks. I have ECS Agent version 1.18.0 on the latest ECS Optimized AMI for Windows. Instance is m4.xlarge. I do not set Task CPU or Task Memory. If I set Container CPU and Container Memory, all is fine, tasks are perfect. If I set Container CPU to 0 or do not set it, my tasks continually go to STOPPED with the error mentioned in this issue. I have inspected the STOPPED containers and confirmed CpuPercent is still being set to 1, not 0. If I set my Container CPU, CpuPercent get set correctly (ex: 12). Any idea what may be wrong here?

I've gone through ECS Agent Code and it seems ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND affects only Task CPU, you still have to set Container CPU to some meaningful value. Basically, if you use this flag your tasks will be allowed to use all resources specified by Container CPU.

Personally, I'd avoid using this flag. Just set both Task CPU and Container CPU and call it a day.

Thank you so much for the reply. So if you set Task CPU and use the env variable, does Container CPU even matter? Does Task CPU override Container CPU? (If I only set Container CPU and not Task CPU, Container CPU is used to calculate resources). Here are 2 examples I just encountered:

1) 4096 total CPU registered. Task CPU = 1024, Container CPU = 512, 2 tasks running. Results in 2048 CPU available. 2) 4096 total CPU registered. Task CPU = 512, Container CPU = 512, 2 tasks running. Results in 3072 CPU available.

I guess also, does Task CPU even matter if is supposedly unbounded? Thanks again for the advice and help.

me-viper commented 6 years ago

Hmm. Disregard my previous answer. That is not quite what is happening. Sorry about that.

TaskCPU indeed does nothing for windows containers. The only thing that matters for windows containers is ContainerCPU Units and ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND pair.

According to the code:

  1. if Container CPU Units = 0 (or not set) and UNBOUNDED = true => actual container CpuPercent = 0
  2. if Container CPU Units = 0 (or not set) and UNBOUNDED = false => actual container CpuPercent = 1
  3. if Container CPU Units <= Physical CPUs Number * 1024 / 100 => actual container CpuPercent = 1 (UNBOUNDED is ignored)
  4. otherwise, actual container CpuPercent = Container CPU Units 100 / Physical CPUs Number 1024 (UNBOUNDED is ignored)

For whatever reason option 1 is not working for you.

I've run several experiments and everything looks fine.

  1. Created new cluster (t2.small instance)
  2. Created new task definition (microsoft/iis image) with both TaskCPU and ContainerCPU empty
  3. Created service with this task definition.
  4. As expected continers were failing because of CpuPercent=1 (case 2)
  5. Added system environment variable ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true
  6. Restarted "Amazon ECS" windows service.
  7. Containers started fine, CpuPercent=0 (case 1)

Hope that helps

rymancl commented 6 years ago

@me-viper Thanks for the clarification. I was confused about which CPU units the UNBOUNDED env variable was actually affecting since the ECS console does say Task CPU isn't supported by Windows containers.

I've determined that my issue is that for whatever reason, none of my environment variables are being set via user data:

<powershell>
Import-Module ECSTools
[Environment]::SetEnvironmentVariable("ECS_CONTAINER_START_TIMEOUT", "30m", [System.EnvironmentVariableTarget]::Machine)
[Environment]::SetEnvironmentVariable("ECS_DISABLE_METRICS", "false", [System.EnvironmentVariableTarget]::Machine)
[Environment]::SetEnvironmentVariable("ECS_RESERVED_MEMORY", "4096", [System.EnvironmentVariableTarget]::Machine)
[Environment]::SetEnvironmentVariable("ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND", "true", [System.EnvironmentVariableTarget]::Machine)
Initialize-ECSAgent -Cluster 'Test-Cluster' -EnableTaskIAMRole
</powershell>

image

When I was testing prior, I tried setting them on the instance directly via the UI and also with Powershell, but not specifying "Machine" level. This not working was likely because of this.

BUT, when I run the exact commands from my user data above on the instance, then restart the agent service, my containers do become healthy and CpuUnits=0 as expected! Also my cluster reports 4096 registered, 4096 available with 2 tasks running.

I'm going to keep digging as to why these aren't getting set properly via user data.

Thanks again for all your help!!

jdebbink commented 6 years ago

@rymancl after you run those command also run them like this:

$env:ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND=true
$env:ECS_DISABLE_METRICS=false
...
PTC-JoshuaMatthews commented 6 years ago

My Windows Containers on service fabric are dying with the exact same log messages as the OP. I am running on a local 1 node cluster. Is there a resolution to this yet?

I can pull the same docker image locally and run it without errors.

EDIT: In my case the issue was that I was running docker images built from incompatible kernel version compared to my service fabric vms. Have to go hunting for a while to find a good image based on microsoft/windowsservercore

lugoues commented 6 years ago

I eventually gave up on using the built in service monitor and instead used a combination of this script Wait-Service.ps1 and the following code to handle IIS environment variables.

Write-Output "Setting Environment Variables for Default Web App..."
$exclusionList = @('ProgramFiles(x86)', 'CommonProgramFiles(x86)','TMP','TEMP','USERNAME','USERPROFILE','APPDATA','LOCALAPPDATA','PROGRAMDATA','PSMODULEPATH','PUBLIC','USERDOMAIN','ALLUSERSPROFILE','PATHEXT','PATH','COMPUTERNAME','COMSPEC','OS','PROCESSOR_IDENTIFIER','PROCESSOR_LEVEL','PROCESSOR_REVISION','PROGRAMFILES','PROGRAMFILES','PROGRAMW6432','SYSTEMDRIVE','WINDIR','NUMBER_OF_PROCESSORS','PROCESSOR_ARCHITECTURE','SYSTEMROOT','COMMONPROGRAMFILES','COMMONPROGRAMFILES','COMMONPROGRAMW6432','DRIVERDATA')
$envVars = $(gci env:* | where-object { $_.Name -notin $exclusionList}) | ForEach-Object { "/+`"[name='DefaultAppPool'].environmentVariables.[name='$($_.Name)',value='$($_.Value)']`"" }

& 'C:/windows/system32/inetsrv/appcmd' 'set' 'config' '-section:system.applicationHost/applicationPools' $envVars '/commit:apphost'
.\Wait-Service.ps1 -ServiceName W3SVC

This doesn't handle duplicates properly, I ditched fixing it since it wasn't a use case we needed to support but I hope this provides someone with a way out of this ServiceMonitor headache.

peterngai commented 5 years ago

@Lugoues, in reference to your workaround instead of using servicemonitor, are you basically implementing something like the following (taken from Dockerfile of IIS) ? I'd like to give this a try myself, but just seeking some clarify:

FROM microsoft/windowsservercore:1803

RUN powershell -Command Add-WindowsFeature Web-Server;

Write-Output "Setting Environment Variables for Default Web App..." $exclusionList = @('ProgramFiles(x86)', 'CommonProgramFiles(x86)','TMP','TEMP','USERNAME','USERPROFILE','APPDATA','LOCALAPPDATA','PROGRAMDATA','PSMODULEPATH','PUBLIC','USERDOMAIN','ALLUSERSPROFILE','PATHEXT','PATH','COMPUTERNAME','COMSPEC','OS','PROCESSOR_IDENTIFIER','PROCESSOR_LEVEL','PROCESSOR_REVISION','PROGRAMFILES','PROGRAMFILES','PROGRAMW6432','SYSTEMDRIVE','WINDIR','NUMBER_OF_PROCESSORS','PROCESSORARCHITECTURE','SYSTEMROOT','COMMONPROGRAMFILES','COMMONPROGRAMFILES','COMMONPROGRAMW6432','DRIVERDATA') $envVars = $(gci env:* | where-object { $.Name -notin $exclusionList}) | ForEach-Object { "/+"[name='DefaultAppPool'].environmentVariables.[name='$($_.Name)',value='$($_.Value)']"" }

& 'C:/windows/system32/inetsrv/appcmd' 'set' 'config' '-section:system.applicationHost/applicationPools' $envVars '/commit:apphost' .\Wait-Service.ps1 -ServiceName W3SVC

EXPOSE 80

lugoues commented 5 years ago

@peterngai yes but I do it in a run script. Below are my Dockerfile and run.ps1. My file structure has a rootfs directory, which contains run.ps1 and Wait-Service.ps1, next to the Dockerfile that gets copied in its entirety to the root of the image. It also expects the application code to be in an app directory next to the Dockerfile, but you can easily change that.

./Dockerfile

FROM microsoft/aspnet:4.7.1-windowsservercore-ltsc2016

SHELL ["powershell", "-Command", "$ErrorActionPreference = 'Stop'; $ProgressPreference = 'SilentlyContinue';"]

######################### install splunk
COPY installers /installers

RUN \
# Install IIS Rewrite module
    & 'c:/installers/rewrite_amd64_en-US.msi' /qn /quiet /norestart; Get-Process -Name "msiexec" | Wait-Process; \

# Cleanup
    Remove-Item -Path /installers -Recurse;\
# setup IIS
    Install-WindowsFeature NET-Framework-45-ASPNET ; \
    Install-WindowsFeature Web-Asp-Net45; \
    Remove-WebSite -Name 'Default Web Site';\
    mkdir 'c:\app'; \
    New-Website -Name 'Default Web Site' -Port 80 -PhysicalPath 'c:\app' -ApplicationPool 'DefaultAppPool';

COPY rootfs /
COPY app /app

ENTRYPOINT ["powershell", "c:/run.ps1"]

./rootfs/run.ps1

if( Test-Path env:TZ) {
  Write-Output "Setting TimeZone: $env:TZ"
  Set-TimeZone "$env:TZ"
}

# ToDo: This should remove already set environmentVariables otherwise it will fail setting duplicates
Write-Output "Setting Environment Variables for Default Web App..."
$exclusionList = @('ProgramFiles(x86)', 'CommonProgramFiles(x86)','TMP','TEMP','USERNAME','USERPROFILE','APPDATA','LOCALAPPDATA','PROGRAMDATA','PSMODULEPATH','PUBLIC','USERDOMAIN','ALLUSERSPROFILE','PATHEXT','PATH','COMPUTERNAME','COMSPEC','OS','PROCESSOR_IDENTIFIER','PROCESSOR_LEVEL','PROCESSOR_REVISION','PROGRAMFILES','PROGRAMFILES','PROGRAMW6432','SYSTEMDRIVE','WINDIR','NUMBER_OF_PROCESSORS','PROCESSOR_ARCHITECTURE','SYSTEMROOT','COMMONPROGRAMFILES','COMMONPROGRAMFILES','COMMONPROGRAMW6432','DRIVERDATA')
$envVars = $(gci env:* | where-object { $_.Name -notin $exclusionList}) | ForEach-Object { "/+`"[name='DefaultAppPool'].environmentVariables.[name='$($_.Name)',value='$($_.Value)']`"" }

& 'C:/windows/system32/inetsrv/appcmd' 'set' 'config' '-section:system.applicationHost/applicationPools' $envVars '/commit:apphost'

Write-Output "Post Config"
$varOut = $(& 'C:/windows/system32/inetsrv/appcmd' @('list', 'config', '-section:system.applicationHost/applicationPools'))
Write-Output $varOut

Write-Output "Starting Service Monitor..."
.\Wait-Service.ps1 -ServiceName W3SVC
ishu3101 commented 5 years ago

I'm getting the following error.

$ docker pull microsoft/iis
Using default tag: latest
latest: Pulling from microsoft/iis
Digest: sha256:7164927df4caa4064f291263b692d3bb842f5ca8ab9515757b5e1da6b5656112
Status: Image is up to date for microsoft/iis:latest

$ docker run -it microsoft/iis

 Service 'w3svc' has been stopped

 Service 'w3svc' started

CTRL signal received. The process will now terminate.

It gets stuck there until you press Ctrl+C at which point you see the CTRL signal received. The process will now terminate message

shirhatti commented 5 years ago

@ishu3101 That's expected behavior. If you don't want to ServiceMonitor to block, I'd recommend running docker run -d microsoft/iis

ishu3101 commented 5 years ago

How do u run in interactive mode though?

shirhatti commented 5 years ago

@ishu3101 Please file a new issue as this thread is tracking a separate issue

ishu3101 commented 5 years ago

@shirhatti Managed to figure it.

bariscaglar commented 5 years ago

It looks like there were multiple issues being discussed but the most prominent one due to the bug in Amazon ECS is resolved with workarounds suggested. If there are other unresolved issues please file them separately.

JoseFMP commented 5 years ago

For me this issue keeps happening. No AWS here. Just bare metal.

slavah commented 5 years ago

Issue closed ... but same issue asp.net:4.8 docker on ECS. Any solution except hacky workarounds?

MattJeanes commented 3 years ago

I had exactly the same exception, but for me it was caused by CPU starvation by setting the request/limits too low in Kubernetes.

chrisjohnson00 commented 3 years ago

I had exactly the same exception, but for me it was caused by CPU starvation by setting the request/limits too low in Kubernetes.

Sorry to bump a closed issue... But @MattJeanes - what requests/limits did you settle on? Even without limits defined I get this error.

MattJeanes commented 3 years ago

@chrisjohnson00 for me using an ASP.NET MVC app I settled on about 200 millicores and 400 megabytes of memory, if you don't have limits and your node has plenty of capacity maybe your issue is something else, but try and give it at least those in the requests and see what happens, good luck!

caractacus commented 3 years ago

[SOLVED] After getting stuck with this, I decided to clone the repo and make some adjustments in order to determine what causes this.

In IISConfigUtil.cpp line 231 there is a 5-second timeout for APPCMD to complete. Containers that run only IIS may achieve this, but in the real world this time limit is somewhat optimistic

I increased the timeout to 30 seconds - which seems to be necessary when more than 2 processors are available to the container. Counter-intuitive, maybe, but the load on my container on startup is very high as a legacy app was ported, and runs multiple Tomcat8 servers. .

https://github.com/microsoft/IIS.ServiceMonitor

IISConfigUtil.cpp 231: WaitForSingleObject(pi.hProcess, 30000);

Yes, docker compose would be the way to design this, but the original legacy product requires all components on a single host.

siobam commented 3 years ago

Try to change NT service startup command type from auto to manual.

Saibamen commented 3 years ago

Try to change NT service startup command type from auto to manual.

This should be done in official dotnet Docker images...

houssem11957 commented 2 years ago

I faced this probelm , when i tried to run the website hosted on IIS , it was my bad just using Docker run ImageName but intead i had to use docker run --name containerName -d -p 5006:80 imagename i had to map the website to a specefic port thanks