microsoft / service-fabric-issues

This repo is for the reporting of issues found with Azure Service Fabric.
168 stars 21 forks source link

Reverse Proxy Health Probe Support #607

Closed lancelind closed 5 years ago

lancelind commented 7 years ago

The Service Fabric Reverse Proxy currently passes all incoming requests to any instance of a stateless service listed in the Naming service regardless if the instance is healthy or not. We need support for ‘health probe' in the Reverse Proxy to ensure only healthy stateless instances get forwarded traffic.

This feature is akin to other Azure technologies that proxy incoming traffic, like Azure Load Balancer, and can provide real benefit to overall availability.

I would envision the ability to opt-in to instance health checking, and one of the default ways the proxy could determine service instance health would be through the existing Service Fabric health model.

lukepuplett commented 5 years ago

Good call. We would also like this feature. We're currently writing ASP.NET middleware to respond 503 if the SF health for the recipient microservice is bad.

aljo-microsoft commented 5 years ago

@lukepuplett and @lancelind

Lighting up a redundant feature of your LB wouldn't solve your problem.

Please review the following: https://docs.microsoft.com/azure/service-fabric/service-fabric-diagnostics-event-generation-app https://docs.microsoft.com/azure/service-fabric/service-fabric-health-introduction https://docs.microsoft.com/azure/load-balancer/load-balancer-custom-probe-overview https://azure.microsoft.com/resources/samples/service-fabric-watchdog-service/

As you've stated this feature request is redundant to Azure Loadbalancer probes functionality; use them in Azure to ensure traffic is only routed to healthy services. If on prim, configure your LB to only route traffic to healthy services.

Also it sounds like you should review your telemetry and ensure you are accurately modeling your services health; as it isn't correct if your service is available but unhealthy.

E.G. You can use a Watch Dog services to collect custom health telemetry of your services, and inform your SF cluster the service is unhealthy; while ensuring your LB's is configured with appropriate rules to determine the service is available to accept traffic.

I.E. A simple LB rule that checks if a service endpoint is discoverable/returns some payload, isn't sufficient to catch implementation faults, that doesn't cause the service to terminate or return a payload that implies unhealthy; These experience probes are only one part of the solution you should be measuring. You have Infrastructure performance to measure, and internal service implementation to monitor for your service to be determined available and healthy.

lukepuplett commented 5 years ago

So we have a custom watchdog and probes that runs in the background of our service. One probe checks that a Mongo database is available and the other checks RabbitMQ. Without the DB or the MQ, the microservice instance is useless so we mark it as Error.

We'd expect the Reverse Proxy to stop routing traffic to this instance, and if all instances or the whole service is in Error, give a 503.

It's important to note a) we are on-prem b) we have only a single service exposed publicly to our public internet load balancer and all other services communicate via the Reverse Proxy service.

It seems reasonable to us that the RP service should have the smart balancing behaviour described above, in fact we were surprised when we discovered that it didn't.

Perhaps we're misunderstanding the design ethos of the health model. To be honest, its very complicated when combined with the complexity of various SF usage modes, so the documentation is necessarily lengthy so it tends to overwhelm my short-term memory and I have to re-read it over and over.

santo2 commented 5 years ago

Agree with Luke above, what is a reasonable solution for this? Is there a way to let the Reverse Proxy try the next node when not available? For example on a Web API, On 404 and 503 it tries another node to see if it can get an answer there. Is there a similar solution for services non web api?

lukeholbertmsft commented 5 years ago

Anyone here get a proper response or solution for this issue, @lukepuplett, @lancelind, @santo2? I have opened a new issue because we are having this problem as well.

ojasp commented 5 years ago

@lukeholbertmsft Did you get any response or know if this can be solved any other way?

lukeholbertmsft commented 5 years ago

@ojasp The response I got was to utilize OnOpenAsync to perform any initialization logic for the service. The url will not be registered, and therefore the service cannot be hit, until this method completes. I don't believe they have any immediate plans to support the reverse proxy health ask here.