microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Respect Service Health from Reverse Proxy #577

Open lukeholbertmsft opened 5 years ago

lukeholbertmsft commented 5 years ago

We are using Service Fabric Reverse Proxy to make calls to stateless service fabric services, but we are seeing that traffic is coming into services during deployments when they are in an error state before they are marked as healthy. This is queuing up requests and preventing the services from starting up successfully. We would like some clarity on why Reverse Proxy is not respecting the health of our services and if there is a workaround here to only route traffic to healthy instances. Previous issues on this topic have been closed without proper explanation: https://github.com/Azure/service-fabric-issues/issues/607

lukepuplett commented 5 years ago

Our team also finds this behaviour very strange. We fully expected the SF Reverse Proxy to be aware of SF health and intelligent in this regard.

The explanation alongside the closure in microsoft/service-fabric-issues#607 seems to miss the point.

It seems such an obvious feature that we would like to know if perhaps we're all missing something.

ojasp commented 5 years ago

Did you guys hear anything from SF team?

santo2 commented 5 years ago

same guys in each post, wanting the same thing, but no answer :)

ojasp commented 5 years ago

@santo2 We have given up on SF Reverse Proxy and going with Traefik since that has a way to configure health checks, which is important for us.

masnider commented 4 years ago

Yes, today nothing in SF's naming/name resolution mechanism looks at health. Health is primarily used to drive the safety of upgrades.

There is a pending feature on our side to take health into consideration when returning naming addresses. The reason this does not exist to date is that historically either a) calls were coming into services via something like the Azure NLB (or Traefik :) ) that had its own notion of health probes or b) the communication was local within the cluster and people followed our guidance that if a service determined it could not accept calls that it should go down (ReportFault), which would trigger an address change and remove the old address from the address table.

It is apparently more common now to leave services up and running, but degraded, and to expect the network or orchestrator to be smarter in deciding whether or not to expose that service, We understand this need and desire, but it's not how SF has ever worked so we are taking our time to expose this capability properly at the lower levels such that if a service wishes to remove itself from the address list it can do so. The capability would then propagate up to other layers such as the Reverse Proxy, Service Proxy, DNS, etc automatically.