Open lukeholbertmsft opened 5 years ago
Our team also finds this behaviour very strange. We fully expected the SF Reverse Proxy to be aware of SF health and intelligent in this regard.
The explanation alongside the closure in microsoft/service-fabric-issues#607 seems to miss the point.
It seems such an obvious feature that we would like to know if perhaps we're all missing something.
Did you guys hear anything from SF team?
same guys in each post, wanting the same thing, but no answer :)
@santo2 We have given up on SF Reverse Proxy and going with Traefik since that has a way to configure health checks, which is important for us.
Yes, today nothing in SF's naming/name resolution mechanism looks at health. Health is primarily used to drive the safety of upgrades.
There is a pending feature on our side to take health into consideration when returning naming addresses. The reason this does not exist to date is that historically either a) calls were coming into services via something like the Azure NLB (or Traefik :) ) that had its own notion of health probes or b) the communication was local within the cluster and people followed our guidance that if a service determined it could not accept calls that it should go down (ReportFault), which would trigger an address change and remove the old address from the address table.
It is apparently more common now to leave services up and running, but degraded, and to expect the network or orchestrator to be smarter in deciding whether or not to expose that service, We understand this need and desire, but it's not how SF has ever worked so we are taking our time to expose this capability properly at the lower levels such that if a service wishes to remove itself from the address list it can do so. The capability would then propagate up to other layers such as the Reverse Proxy, Service Proxy, DNS, etc automatically.
We are using Service Fabric Reverse Proxy to make calls to stateless service fabric services, but we are seeing that traffic is coming into services during deployments when they are in an error state before they are marked as healthy. This is queuing up requests and preventing the services from starting up successfully. We would like some clarity on why Reverse Proxy is not respecting the health of our services and if there is a workaround here to only route traffic to healthy instances. Previous issues on this topic have been closed without proper explanation: https://github.com/Azure/service-fabric-issues/issues/607