Open nicolaka opened 7 years ago
Thanks for the report @nicolaka. Can you please update it with the version you're using?
Also, can you please include docker node inspect
output for each node?
Here's the info:
docker version
Client:
Version: 17.04.0-ce
API version: 1.28
Go version: go1.7.5
Git commit: 4845c56
Built: Mon Apr 3 18:07:42 2017
OS/Arch: linux/amd64
Server:
Version: 17.04.0-ce
API version: 1.28 (minimum version 1.12)
Go version: go1.7.5
Git commit: 4845c56
Built: Mon Apr 3 18:07:42 2017
OS/Arch: linux/amd64
Experimental: true
and for node inspect:
cat node-inspect.txt
[
{
"ID": "z2h81e0en4km17xj4fy3w7d1d",
"Version": {
"Index": 8715
},
"CreatedAt": "2017-04-12T05:55:54.189525855Z",
"UpdatedAt": "2017-05-01T10:46:00.671308998Z",
"Spec": {
"Labels": {
"com.docker.ucp.SANs": "10.20.2.220,127.0.0.1,localhost,ucp-worker-10-20-2-220-eu-west-1b,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2",
"environment": "production",
"region": "eu-west-1",
"zone": "b"
},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "ucp-worker-10-20-2-220-eu-west-1b",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 15768662016
},
"Engine": {
"EngineVersion": "17.04.0-ce",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "ipvlan"
},
{
"Type": "Network",
"Name": "macvlan"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready",
"Addr": "10.20.2.220"
}
}
]
[
{
"ID": "ttj80hvbbjo1ny5g1kye3g5le",
"Version": {
"Index": 8714
},
"CreatedAt": "2017-04-12T05:53:23.70300962Z",
"UpdatedAt": "2017-05-01T10:46:00.245186769Z",
"Spec": {
"Labels": {
"com.docker.ucp.SANs": "nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.10.2.28,127.0.0.1,localhost,ucp-worker-10-10-2-28-us-west-2b",
"environment": "production",
"region": "us-west-2",
"zone": "b"
},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "ucp-worker-10-10-2-28-us-west-2b",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 15768571904
},
"Engine": {
"EngineVersion": "17.04.0-ce",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "ipvlan"
},
{
"Type": "Network",
"Name": "macvlan"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready",
"Addr": "10.10.2.28"
}
}
]
[
{
"ID": "xn9koneufgqi3c903ghpa1v31",
"Version": {
"Index": 8714
},
"CreatedAt": "2017-04-12T05:53:55.475549039Z",
"UpdatedAt": "2017-05-01T10:46:00.245482053Z",
"Spec": {
"Labels": {
"com.docker.ucp.SANs": "127.0.0.1,localhost,ucp-worker-10-10-3-239-us-west-2c,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.10.3.239",
"environment": "production",
"region": "us-west-2",
"zone": "c"
},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "ucp-worker-10-10-3-239-us-west-2c",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 15768571904
},
"Engine": {
"EngineVersion": "17.04.0-ce",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "ipvlan"
},
{
"Type": "Network",
"Name": "macvlan"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready",
"Addr": "10.10.3.239"
}
}
]
[
{
"ID": "zcnpu0icgwe12ki01bh06c489",
"Version": {
"Index": 8715
},
"CreatedAt": "2017-04-12T05:56:05.073664497Z",
"UpdatedAt": "2017-05-01T10:46:00.671398378Z",
"Spec": {
"Labels": {
"com.docker.ucp.SANs": "ucp-worker-10-20-1-10-eu-west-1a,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.20.1.10,127.0.0.1,localhost",
"environment": "production",
"region": "eu-west-1",
"zone": "a"
},
"Role": "worker",
"Availability": "active"
},
"Description": {
"Hostname": "ucp-worker-10-20-1-10-eu-west-1a",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 15768662016
},
"Engine": {
"EngineVersion": "17.04.0-ce",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "ipvlan"
},
{
"Type": "Network",
"Name": "macvlan"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready",
"Addr": "10.20.1.10"
}
}
]
I think I understand what's going on. You have two placement preferences specified: one for region, then another for zone. This tells the scheduler to that its first priority should be getting an even split between regions, and then within regions, tasks should be divided evenly between nodes in different zones.
There seem to be 2 different values for the region label:
"region": "us-west-2"
"region": "eu-west-1"
so we'd expect tasks to be roughly split between us-west-2
and eu-west-1
.
as you can see zone eu-west-1b got 2/5 tasks
This makes sense, because the 5 tasks had to first be split between eu-west-1
and us-west-2
. In this case, eu-west-1
got 3 of the 5 tasks. Then those tasks were split between zones a
and b
, and b
got 2 of those 3.
and zone us-west-2a got no tasks at all.
I don't see any nodes in region us-west-2
and zone a
. They all seem to be in zones b
or c
.
If your goal is to split evenly over the different zones, without regard to region, you could give the nodes labels with values like eu-west-1b
, and just specify one placement preference referencing that label.
@aaronlehmann there are nodes labeled with zone=a
I just listed the nodes that this services was running on.
What's interesting is if I just the services with replicas=5, then it will spread them across the 5 zones. But that is now what I get when i launch with 3 replicas then scale to 5. Not sure if this is by design.
If it's possible to add node inspect
output for every node, I can take a look and see if the behavior is consistent with the design.
There is some randomness involved - for example when 5 tasks across two regions, one of the regions will get 2 tasks and the other will get 3. It may be that when you ran service create
you got an outcome that happened to be what you wanted, but when you scaled the service later it happened to give different results.
@nicolaka: Does this still look like a bug? Are you able to provide node inspect
output for every node?
@aaronlehmann, assuming your original understanding is correct, this should probably still be considered a bug.
This split-on-first-placement-pref then split-on-second-placement-pref seems like a faulty design. The scheduler should consider all nodes & all rules at once.
For example, the scheduler should be smart enough to determine that placing one more task in us-west-2
is better (satisfies more constraints, without compromise) than placing that same task in eu-west-1
.
@Kent-H: I suppose that when the number of tasks does not split evenly, we could favor branches of the tree with the most nodes underneath them, instead of choosing arbitrary branches for the remaining tasks.
It seems like topology aware scheduling doesn't properly schedule tasks evenly when scaling up. Take this example: I'm running a service with two placement options "region" and "zone". I have 2 regions, one with 3 zones and the other with 2 zones ( total is 5 zones) . When I launch the service with 3 replicas it works fine, but when i try to scale to 5, one of the regions gets two tasks instead of 1.
Launching a service with three replicas:
Seeing they're spread over:
Scaling to 5 ( expecting they will be spread across all 5 zones ):
as you can see zone
eu-west-1b
got 2/5 tasks and zoneus-west-2a
got no tasks at all.cc @aaronlehmann