microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

[BUG] Behaviour of InstanceCount poorly documented #1467

Open asos-robbell opened 1 year ago

asos-robbell commented 1 year ago

Describe the bug It's very unclear based on documentation what the behaviour of InstanceCount is. If I set InstanceCount=1 does that mean 'at least one instance' or 'no more than one instance'? From the documentation I've seen it's unclear and I have a scenario where I want at most one instance of my application running at any one time.

Area/Component: Placement

To Reproduce Steps to reproduce the behavior:

  1. Set InstanceCount=1

Expected behavior Unknown

Observed behavior: Unclear

Screenshots If applicable, add screenshots to help explain your problem.

Service Fabric Runtime Version: ex: 7.1., 7.2.

Environment:

If this is a regression, which version did it regress from?

Additional context I've raised this on StackOverflow where it was met with similar uncertainty. This behaviour needs to be documented:

https://stackoverflow.com/questions/77186214/does-instancecount-1-in-servicefabric-mean-at-least-one-instance-or-no-more-t/77254680


Assignees: /cc @microsoft/service-fabric-triage

flower7434 commented 1 year ago

"...and I have a scenario where I want at most one instance of my application running at any one time." You can never trust SF to have a single instance of anything. So, no, that design will never fly on SF.

mfmadsen commented 1 year ago

Yes, SF will do its best to honor your intention on a single instance, but for SF the availability of your service is the most important, so in cases where a service instance is about to be moved (perhaps if a node is being disabled due to maintenance or such), then SF would most likely spin up a new instance on another node BEFORE closing down the instance running on the node being deactivated. In that sense it means 'at least one instance' (and under normal conditions 'no more than one').

asos-robbell commented 1 year ago

Thanks, @mfmadsen and @FredrikDahlberg. Is there any way to guarantee essentially a singleton service in Service Fabric, e.g. never more than one instance?

flower7434 commented 1 year ago

Thanks, @mfmadsen and @FredrikDahlberg. Is there any way to guarantee essentially a singleton service in Service Fabric, e.g. never more than one instance?

I believe we have some locks in a stateful service that we use to block reentrancy. Maybe something similar can be used.

flower7434 commented 1 year ago

You can NEVER guarantee a singleton service, and despite the documentation, you can't even be assured that an instance of an actor won't be running concurrently. If Service Fabric moves the actor service, it does not wait for the existing actors to shut down, they continue running, but are no longer primary, so they won't be able to update state, but they're still running.

Correct, but even more importantly. Service Fabric does not guarantee that ANY instance of a service is running no matter what you set InstanceCount to. If you have 0 instances it does not help to increase it. It will still be 0.

flower7434 commented 1 year ago

Correct, but even more importantly. Service Fabric does not guarantee that ANY instance of a service is running no matter what you set InstanceCount to. If you have 0 instances it does not help to increase it. It will still be 0.

I'm not sure what you mean by this, my experience has always been, SF creates the services you configure it to create. If it didn't, I would have abandoned it ages ago.

It is not often but it happens a couple of times per year for us. Typically it tries to create an instance but fails. Sometimes because it has not reloaded the config or something. Typically after deployments. The logic for the rollout seems to be wrong. Even if something fails it just continues with the next nodes. I would be much happier if one would be able to restart a service or if it was possible to deploy a service or if it was possible to at least deploy the same version of an app. The solution now is to rebuild the app and then deploy it. Which may take over an hour. I have never managed to get a failed service to run again without deploying a new version.

JohnNilsson commented 1 year ago

@FredrikDahlberg we've been running SF for six years now. Never saw that behaviour. My guess is you've missed something about how upgrades work.

Things to check: