microsoft / service-fabric

Service Fabric is a distributed systems platform for packaging, deploying, and managing stateless and stateful distributed applications and containers at large scale.
https://docs.microsoft.com/en-us/azure/service-fabric/
MIT License
3.03k stars 401 forks source link

Majority Actors all being loaded onto one node #487

Open BMGDigitalTech opened 4 years ago

BMGDigitalTech commented 4 years ago

I have an actor service running on 32 nodes. I keep noticing that one node is running at 100% when the rest are all running under 10%. I created a SQL logging database to see what actors are running on each node, how long they take to processes, and try and figure out what was going on. So what I found was most of the actors are running on this one node. Why? why are they not being balanced when the load on the node goes up? I thought this was part of Service Fabrics abilities. all the nodes are 10 cores with 24GB of ram, 90GB HD space running on NVME drives. This is a snap shot of the last 30 minutes on the actor spread. Is this a bug? is there some way I can force the actors to run on other nodes instead of letting SF making this decision on what node the actor should run on?

actor instances | node 1244 | SF01 1542 | SF02 1486 | SF03 1650 | SF04 1462 | SF05 1666 | SF06 2371 | SF07 2663 | SF08 2572 | SF09 1831 | SF10 1987 | SF11 1896 | SF12 1526 | SF13 1449 | SF14 1960 | SF15 2119 | SF16 2901 | SF17 1766 | SF18 1220 | SF19 1427 | SF20 1363 | SF21 1337 | SF22 2004 | SF23 1371 | SF24 1501 | SF25 1731 | SF26 875 | SF27 427 | SF28 1278 | SF29 1519 | SF30 27944 | SF31 687 | SF32

Expected Behavior

Actors would be balanced between all the nodes based on sever load

Current Behavior

Actors being pilled up on one node

Context (Environment)

Service Fabric Runtime and SDK Version :

6.5.658.9590 | 3.4.664

Operating System :

Windows 2016

Cluster Size :

32 nodes, 10 cores each, 24GB RAM, 90GB HD, on NVME sticks

Possible Workaround

way I can force the actors to run on other nodes instead of letting SF making this decision on what node the actor should run on if possible

BMGDigitalTech commented 4 years ago

one other things.. the actors executes a nodeJs through Process and waits for information to return. The nodeJs process uses puppeteer for various tasks.

JustinKaffenberger commented 4 years ago

How many partitions did you deploy your Actor service with? Service Fabric doesn't balance actors, it balances partitions.

BMGDigitalTech commented 4 years ago

32

so if SF targets a node for an actor to run on, and the resources are at 100% it will not move it?

thanks

JustinKaffenberger commented 4 years ago

I just looked at your list of actor ID/Node pairs and it would seem they are pretty well distributed. I'm actually now confused as to what the problem is, based on the data you posted. What I see is all of the actors ARE distributed across all of the nodes.

BMGDigitalTech commented 4 years ago

that's not an actorID, that's the number of running actors (actor instances). my actorIDs is a guid so SF01 has 1244 different actor instances that ran with in 30 minutes.. SF31 has 27944, and the CPU was hammered at 100% all most the entire time.. and the others never went over 10% actor instances | node 1244 | SF01 1542 | SF02 1486 | SF03 1650 | SF04 1462 | SF05 1519 | SF30 27944 | SF31 687 | SF32

BMGDigitalTech commented 4 years ago

P.S. I am not using statemanager at all.. was planning on it.. but right now I get what I need from the node process, and return to the calling service

BMGDigitalTech commented 4 years ago

does anyone have any ideas on this?

JustinKaffenberger commented 4 years ago

I think you need to frame your problem in terms of partitions, rather than nodes. At the end of the day, its the partitions that are distributed across the nodes. The actors are distributed across the partitions based on the ActorID. So what would be more useful is knowing which partitions are on which nodes, and then which actors are in which partition. Even if you aren't using state management, the partitioning matters.

BMGDigitalTech commented 4 years ago

Thanks for getting back to me..

was off for a little while dealing with whats going on in the world.. So what you are saying where actors execute has nothing to do with load, and SF will not move an actor, or start an actor on a different machine if the load is high.. it' all about the actorID in my case a guid, and the partitional manager.