microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
363 stars 28 forks source link

Intermittent "Connection refused" to MSI endpoint #568

Open malthe opened 1 year ago

malthe commented 1 year ago

This issue is a:

Issue description

Running a container app, we're seeing intermittent connectivity issues running az login --identity.

ERROR: MSI: Failed to retrieve a token from 'http://localhost:42356/msi/token/?resource=https://management.core.windows.net/&api-version=2017-09-01' with an error of 'HTTPConnectionPool(host='localhost', port=42356): Max retries exceeded with url: /msi/token/?resource=https://management.core.windows.net/&api-version=2017-09-01 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f63f27ac2b0>: Failed to establish a new connection: [Errno 111] Connection refused'))'.

That is, "[Errno 111] Connection refused", suggesting that the service has somehow not been brought up yet.

Steps to reproduce

  1. Run a container app where the first action is to login using managed identity, e.g.

    command: ['/bin/bash']
    args: ['-c', 'az login -o none --identity && ./entrypoint.sh']
  2. Observe intermittent connectivity issues.

Expected behavior

The MSI endpoint should be ready immediately.

Actual behavior

The MSI endpoint is not always available.

samavos commented 1 year ago

We've found the same behaviour - we have a container app that fetches secrets on startup, and we sometimes see that fail. Our process then gets started a few seconds later, and it all works. Overall this increases our startup time a lot, causing real issues for us.

samavos commented 1 year ago

We've found the same behaviour - we have a container app that fetches secrets on startup, and we sometimes see that fail. Our process then gets started a few seconds later, and it all works. Overall this increases our startup time a lot, causing real issues for us.

For some further insight on this, I added a crude timing loop as follows:

    log.Info().Msg("Timing MSI startup - connecting to port 42356")
    connectTimeouts := 0
    startupTime := time.Now()
    for {
        d := net.Dialer{Timeout: time.Millisecond * 50}
        conn, err := d.Dial("tcp", "localhost:42356")
        if err != nil {
            time.Sleep(time.Millisecond * 50)
            connectTimeouts++
            if connectTimeouts%20 == 0 {
                log.Warn().Msgf("Timing MSI startup - still waiting - %d", connectTimeouts)
            }
            continue
        }
        log.Info().
            Msgf("Timing MSI startup - done - %d - in %d ms", connectTimeouts, time.Now().Sub(startupTime).Milliseconds())
        conn.Close()
        break
    }

Which resulted in the following log:

3/27/2023, 12:56:46.702 PM Timing MSI startup - connecting to port 42356
3/27/2023, 12:56:47.716 PM Timing MSI startup - still waiting - 20
3/27/2023, 12:56:48.728 PM Timing MSI startup - still waiting - 40
3/27/2023, 12:56:49.742 PM Timing MSI startup - still waiting - 60
3/27/2023, 12:56:50.755 PM Timing MSI startup - still waiting - 80
3/27/2023, 12:56:51.211 PM Timing MSI startup - done - 89 - in 4509 ms

So admittedly just one sample, but taking approximately 4.5 seconds to startup does not seem ideal.

vturecek commented 1 year ago

Just a quick update, we are testing a fix for this that will wait for the managed identity endpoint to be ready before your containers are started, so that connection failures should be very rare even when you use it on startup.

mmigala commented 1 year ago

Having the same issue.

Just started to use Azure Container Apps and connect Azure app config using Managed identity

Seeing this error internally. App works for some time and then stops and restarting doesn't help. Then after an hour starts working again.

Unhandled exception. Azure.Identity.AuthenticationFailedException: ManagedIdentityCredential authentication failed: Retry failed after 4 tries. Retry settings can be adjusted in ClientOptions.Retry or by configuring a custom retry policy in ClientOptions.RetryPolicy. (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356))

image

malthe commented 1 year ago

A workaround using bash:

timeout 10s bash -c "until az login --identity 2>/dev/null; do sleep 1; done" || exit 1
mvromer commented 1 year ago

I see the same error as mmigala in the exact same scenario. I have a .NET 7 minimal API app that connects to an Azure App Config when the app first starts, and I see a number of connection refused errors when the App Config provider tries to acquire a token through the MSI endpoint using a ManagedIdentityCredential. I have been seeing this occur more frequently over the past 2-3 weeks (I can't recall seeing this at all in the 6 months prior).

mmigala commented 1 year ago

I found a reason why this was happening for me.

The problem was that minimum replicas count wasn't set to 1.

App was scaling to 0 and then it couldn't start again.

From docs:

Make sure you create a scale rule or set minReplicas to 1 or more if you don't enable ingress. If ingress is disabled and you don't define a minReplicas or a custom scale rule, then your container app will scale to zero and have no way of starting back up.

Hope this helps others.

Hjortsberg commented 1 year ago

We are effected of this aswell. Ingress enabled, and scale set to min 1 does not remedy the problems.😕

davidbarratt commented 1 year ago

I noticed this happening and I also noticed that using a readiness probe doesn't fix the problem. I'm requesting an access token on each request (it's a PHP application) and intermittently the endpoint won't be accessible, or it will randomly return a 403.

mario-d-s commented 3 months ago

@vturecek any news about that fix that was being tested one year ago?

waynebrantley commented 3 months ago

@vturecek This happens multiple times every week. Was reported 1.5 years ago and a fix was being tested over a year ago. Please advise.

vturecek commented 3 months ago

@mario-d-s, @waynebrantley - sorry for the delay. We made a couple updates to help with this, depending on the type of environment you're running on:

In a Workload Profile Consumption environment, we now support managed identity in init containers. By default, managed identity starts up during the init phase of your application. Containers that run during the main phase should be able to access the local managed identity endpoint immediately because we wait for the init phase to complete before switching to main, however init containers may start before managed identity is available, and may need to perform retries.

In all other environments, we don't yet support managed identity for init containers. However, for those environments, we don't start your container until the local managed identity endpoint is available.

@waynebrantley are you seeing connection refused errors when your container starts? If so, in what phase (main or init) and what kind of environment are you on (Workload profile or consumption-only)?

mario-d-s commented 3 months ago

@vturecek we are on a Dedicated Workload profile and do not use init containers. So if I understand correctly, from your response the following applies:

for those environments, we don't start your container until the local managed identity endpoint is available.

That is not what we're seeing. Multiple times a day we are getting errors from different containers that look like this:

Azure.Identity.AuthenticationFailedException: ManagedIdentityCredential authentication failed: Retry failed after 6 tries. Retry settings can be adjusted in ClientOptions.Retry or by configuring a custom retry policy in ClientOptions.RetryPolicy. (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356)) (Connection refused (localhost:42356))

This is tripping up our monitoring. We will look into increasing the threshold for container restarts at which we get notified but that is just a workaround, it should simply not be happening.

waynebrantley commented 2 months ago

@vturecek sorry for the delayed reply. We are technically using a 'consumption' profile - but due to networking issues - the azure team has us on some kind of dedicated workload profile!

We are seeing those errors when the containers try and start in the main phase. We do not have init containers at this time.

This happens quite often.

waynebrantley commented 3 weeks ago

These errors must have been fixed as we are not seeing them anymore.