microsoft / azure-container-apps

Roadmap and issues for Azure Container Apps
MIT License
355 stars 27 forks source link

ACA networking with workload profile consumption-only workload is incompatible with Microsoft Orleans #1119

Open onionhammer opened 3 months ago

onionhammer commented 3 months ago

Please provide us with the following information:

This issue is a: (mark with an x)

Issue description

I have observed that when deploying Orleans to a cluster of 2 or more silos, client apps have a roughly 50/50 shot of being able to communicate with the target node.

Workload profiles:

Non-workload profiles

This seems to be related with https://github.com/microsoft/azure-container-apps/issues/721

Steps to reproduce

  1. See https://github.com/onionhammer/orleans-aca-repro

Expected behavior [What you expected to happen.] Orleans should work, by default

Actual behavior [What actually happened.] Orleans only works 50% of the time, the rest of the time grain/actor invocations timeout and clients are unable to communicate with silos

Screenshots
If applicable, add screenshots to help explain your problem.

image

In the above screenshot:

Additional context

This seems to be related with https://github.com/microsoft/azure-container-apps/issues/721

After running the repro for nearly a full day

Workload profiles: image

No workload profiles: image

Greedygre commented 3 months ago

Hi @onionhammer

Thanks for reporting this issue. I am investigating. May I ask what is the tool you are using at this picture? image

Thanks!

onionhammer commented 3 months ago

@Greedygre that's an application insights availability test hitting an ASP.NET healthcheck

howang-ms commented 3 months ago

Hi @onionhammer We have tried but are not able to repro this issue by the same setup as you did. However, we do find the ACA environment you use for the testing. And it seems the network between web and one of the silo pods is unhealthy. The connection between web and silo should be long connection in theory, but we see the connection get interrupted many times. It seems to be a special case not a common issue in the Consumption workload profile. However, we will need more information (properly network capture) for root cause this.

Can you try to repro this issue again? If you can repro this issue, please keep the environment and app, then drop an email to "acasupport at microsoft dot com", we will investigate ASAP.

onionhammer commented 3 months ago

Hi @howang-ms

I've reproduced the issue and emailed acasupport - as you can see, the issue doesn't start happening immediately, but can take several hours to show up.

image

onionhammer commented 3 months ago

Hi @howang-ms any update?