microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
360 stars 29 forks source link

22s cold start for hello world image #997

Open MaksymShuldiner opened 9 months ago

MaksymShuldiner commented 9 months ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

After creating Azure Container App from the default Hello World image, on the cold start, it takes 22 seconds to get a response. It's insane, because I had the same issue for my web app and thought that's because of container.

image

Steps to reproduce

  1. Create Azure Container App from default image
  2. Track time for the cold start

Expected behavior - Cold start for the such small container should be 1-2 seconds

Actual behavior 22 seconds

MaksymShuldiner commented 9 months ago

also, I've tried launching 70MB node.js container and delay is still 20s. What can be a reason?

MaksymShuldiner commented 9 months ago

image However, logs shows minimal delay in container pull and start (5 seconds)

srgplus commented 9 months ago

+1 for that one.

guibranco commented 9 months ago

Also having this issue with .NET 7 APIs

MaksymShuldiner commented 9 months ago

Also having this issue with .NET 7 APIs

actually I've redeployed container into Cloud Run and got cold start as 2.5 seconds. So it's not definitely problem of container.

howang-ms commented 9 months ago

We are investigating this.

pollaktamas commented 9 months ago

Experiencing the same issue with a Node image. I am exposing an http endpoint with the default scaler.

It takes around 20 seconds for the server to respond. According to system logs the container creation + startup is 1s in total, pulling the image from the registry is 5s, and the only log regarding the remaining 14s is: Scaled apps/v1.Deployment k8se-apps/<my-app> from 0 to 1

edgarhuichen commented 9 months ago

Same here. I have a simple python app and it takes around 30s to start the container and return a simple response. One thing I notice is the health probe often fails so the service retries and causes delay.

howang-ms commented 9 months ago

For a quick update: We are working on the improvement, will give update as soon as the fix rollout.

adeturner commented 8 months ago

Looking forward to the fix. My 10 seconds example below with consumption plan and http scaler: 13:13:37 to 13:13:47. The App was previously started so the assumption is that the image was cached locally?

ContainerAppConsoleLogs_CL
| where RevisionName_s == "aca-weu-example--q5iwjne"

2023-12-20T13:13:47.8338291Z listening on port 80

ContainerAppSystemLogs_CL
| where RevisionName_s == "aca-weu-example--q5iwjne"

2023-12-20 13:19:07 +0000 UTC Deactivated apps/v1.Deployment k8se-apps/aca-weu-example--q5iwjne from 1 to 0
2023-12-20 13:19:07 +0000 UTC Stopping container examplecontainerapp
2023-12-20 13:19:07 +0000 UTC Pulling image "mcr.microsoft.com/azuredocs/containerapps-helloworld:latest"
2023-12-20 13:13:40 +0000 UTC Successfully pulled image "mcr.microsoft.com/azuredocs/containerapps-helloworld:latest" in 54.761604ms (54.768045ms including waiting)
2023-12-20 13:13:40 +0000 UTC Created container examplecontainerapp
2023-12-20 13:13:40 +0000 UTC Started container examplecontainerapp
2023-12-20 13:13:40 +0000 UTC Replica 'aca-weu-example--q5iwjne-7fb6fdd858-wdr69' has been scheduled to run on a node.
2023-12-20 13:13:37 +0000 UTC readiness probe failed: connection refused
omni-htg commented 7 months ago

Hello team! Has there been any good news regarding this fix?

Thanks!

fengwusheng commented 7 months ago

Google Cloud Run can set container to Gen1 to make cool start fast, only about 2 seconds, the container could be ready to use. We do not always have a newer version, Azure do not have to pull the image every time, make it an option such as "manually version pull" to make cool start easier and faster. Or option like Gen1 Gen2. The most important thing is: cool start faster, container should be ready in 2-3 seconds for small image.

mluiten commented 6 months ago

Hi @howang-ms -- any updates on the improvement 2 months later? Seeing the same. Small 50MB native docker image, pull time of 1-2 seconds, but total time for the first response is somewhere between 15-25 seconds. I love the premise of "scale to zero", but this makes it practically unusable, even when using small, fast images and runtimes.

tbaroti commented 5 months ago

Hi @howang-ms, @SophCarp, are there any updates here? As @mluiten mentioned before, having the possibility to scale to zero (one of the mainly advertised feature of ACA) is not so useful if a cold start of 20-30 seconds is expected even for the smallest workloads. Even if we can't expect a quick solution, some update would be highly appreciated. Thank you!

03eltond commented 5 months ago

This is a somewhat serious concern for us as well. For scale to zero, it can actually take well over a minute for us since our web server, once finally started up, then calls out to other backend containers that then have to spin up. Add to this the 2-3 minutes for node startup time on dedicated workloads and it's a pretty serious concern. Even for production loads, waiting 20-30 seconds for containers while scaling is proving to be very negative on our load testing compared to on prem results. We don't expect 1:1 for on prem since they are always on, but we expect to be able to meet demand in a reasonably timely manner when traffic is bursting.

zyofeng commented 4 months ago

The newly introduced workload profile does not address this unfortunately and is seriously restricting the adoption of container apps to basically async workloads (messaging, background etc).

mluiten commented 3 months ago

@howang-ms so, a few more months has passed without any additional information -- please let us know if you put this on the back-burner, or if no additional improvements are possible so we can start to look for other ways to improve the situation (for example; we're looking if we can call a fire-and-forget "ping" endpoint in the loading screen of our app). Either way, I would love some communication about this; the silence is deafening.

timbodeit commented 2 months ago

@torosent @microsoftopensource Hong Wang doesn't seem to be responding to the question if there is any update available for several months now. Is there anyone else you could refer us to that would be able to provide an update on where this issue stands?

@howang-ms mentioned you were working on an improvement. However I couldn't find a corresponding entry on the roadmap.

Fully agree with @mluiten. Scale to zero is practically unusable for any services required for user-interactive integration at this time. Not knowing what sort of performance improvements we can expect and how long it will take for these to ship makes it hard to make an informed business decision whether Azure Container Apps are the right fit for our projects or if we should focus on alternatives like Google Cloud Run instead.

cachai2 commented 2 months ago

Hi folks, thank you for your patience and for the continued feedback on this issue. We understand the importance of reducing cold start times for your applications and feel your pain. The changes that Hong had mentioned should have made ~7s improvement, but we are investigating this issue and the delta. We will provide an update once the investigation is complete. We do have more improvements in the pipeline for both cold start and the overall performance of the platform that we will be prioritizing. I do want to set expectations accordingly though and ask that you continue to be patient as the types of changes that we need to make for improving cold start are not small work items and will take time to implement and rollout. We can't currently commit to timelines, and major improvements will most likely not occur in the near term.

In the near term, for workloads that need to be much more responsive when scaling out today, we'd recommend you configure your minimum replica count for the application to be 1 or greater. This will ensure that your apps are always ready to go, and when inactive, these replicas will be billed at much lower idle rates per our pricing page. Workarounds like what @mluiten mentioned about pinging the endpoint of the app will also work. However, due to the nature of billing for Azure Container Apps, the bill may be higher than setting the minimum replica count as pinging the application will make it become active vs inactive pricing if the app is set at a minimum and hasn't received traffic.

As folks have expressed on this thread, we can do a better job as a team of keeping you up to date on performance improvements that we are making. I've setup a recurring reminder on my calendar to update this thread when there are major improvements we have made.

Finally, please continue to provide feedback here on your specific scenarios and the cold start times you are running into. The details provided from specific tests done by @MaksymShuldiner, @pollaktamas, @adeturner, and others on the thread are valuable for our understanding of the types of scenarios and workloads which are impacted. We are aiming to make general cold start improvements, but if there are specific changes we can make for certain high demand workloads, your feedback will be invaluable in prioritizing. It will also help to keep us honest, if the improvements you see in your production applications don't match the scale of improvements we claim, we want to know. If folks want to provide more direct feedback and discuss, I'd also be happy to chat about performance on Azure Container Apps. Feel free to reach out to me at cachai [at] microsoft [dot] com with details on your scenarios.

cachai2 commented 2 months ago

Hi folks, providing a quick update as mentioned above. We've completed our investigations. The following numbers are for the quickstart image scenario without using managed identity. We've seen ~7s improvement at the P50 based on the changes Hong mentioned from ~22s to ~15s. The P95 after improvements was ~20s. In the last two weeks, we were able to also identify a quick fix that we have since implemented which has further improved these numbers to ~10s at the P50 and ~15s at the P95. Feel free to validate these numbers on your own and let us know if you are seeing any differences. Please note that these numbers are based on the quickstart image so depending on your images, dependencies, etc, you will most likely see higher cold start numbers for your actual app workloads.

We do have further improvements we are working on. However, as mentioned in my previous message, these changes will be further out, and we don't currently have an ETA on when they will land. In the meantime, please continue to provide feedback on this area, and let us know what cold start numbers you are seeing for your scenarios. Thank you!

GibreelAbdullah commented 1 month ago

The cold start time today is about 12s for a node app. The same image is deployed in cloud run where the cold start time remains at around 2-3s. Reading all the comments, there has been a significant progress (from 25s to 12s) but much more improvement is still needed.

ctigrisht commented 1 month ago

I wanted to stay on azure but this makes it impossible, GCP run gets a response within 2 seconds on a cold start, the last I checked on azure it was 14-18 seconds for a nodejs image and a bit more for an ASP.NET image