moby / swarmkit

A toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more.
Apache License 2.0
3.36k stars 616 forks source link

topology aware scheduling doesn't work when scaling up #2162

Open nicolaka opened 7 years ago

nicolaka commented 7 years ago

It seems like topology aware scheduling doesn't properly schedule tasks evenly when scaling up. Take this example: I'm running a service with two placement options "region" and "zone". I have 2 regions, one with 3 zones and the other with 2 zones ( total is 5 zones) . When I launch the service with 3 replicas it works fine, but when i try to scale to 5, one of the regions gets two tasks instead of 1.

Launching a service with three replicas:

$ docker service create \
>   --name pets \
>   --replicas 3 \
>   --mount type=bind,source=/etc/hostname,destination=/tmp/worker/hostname \
>   --publish mode=host,target=5000,published=32000,protocol=tcp \
>   --constraint 'node.role==worker' \
>   --placement-pref 'spread=node.labels.region' \
>   --placement-pref 'spread=node.labels.zone' \
>   dtr.us-west.dcus17.dckr.org/dockercon/pets:v2

Seeing they're spread over:

$ docker service ps pets
ID                  NAME                IMAGE                                           NODE                                DESIRED STATE       CURRENT STATE           ERROR               PORTS
jlu6v7jrag3l        pets.1              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-20-2-220-eu-west-1b   Running             Running 2 minutes ago                       *:32000->5000/tcp
6rbew6twygwj        pets.2              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-20-1-73-eu-west-1a    Running             Running 2 minutes ago                       *:32000->5000/tcp
fz91gc68t7mg        pets.3              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-10-2-28-us-west-2b    Running             Running 2 minutes ago                       *:32000->5000/tcp

Scaling to 5 ( expecting they will be spread across all 5 zones ):

$ docker service ps pets
ID                  NAME                IMAGE                                           NODE                                DESIRED STATE       CURRENT STATE                ERROR               PORTS
jlu6v7jrag3l        pets.1              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-20-2-220-eu-west-1b   Running             Running about an hour ago                        *:32000->5000/tcp
0aauudjh9w3n        pets.2              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-20-2-216-eu-west-1b   Running             Starting 26 seconds ago
fz91gc68t7mg        pets.3              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-10-2-28-us-west-2b    Running             Running about an hour ago                        *:32000->5000/tcp
qpm9ecydauc6        pets.4              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-10-3-239-us-west-2c   Running             Running about a minute ago                       *:32000->5000/tcp
irliapzsmm45        pets.5              dtr.us-west.dcus17.dckr.org/dockercon/pets:v2   ucp-worker-10-20-1-10-eu-west-1a    Running             Starting 26 seconds ago

as you can see zone eu-west-1b got 2/5 tasks and zone us-west-2a got no tasks at all.

cc @aaronlehmann

aaronlehmann commented 7 years ago

Thanks for the report @nicolaka. Can you please update it with the version you're using?

aaronlehmann commented 7 years ago

Also, can you please include docker node inspect output for each node?

nicolaka commented 7 years ago

Here's the info:

docker version
Client:
 Version:      17.04.0-ce
 API version:  1.28
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.04.0-ce
 API version:  1.28 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   4845c56
 Built:        Mon Apr  3 18:07:42 2017
 OS/Arch:      linux/amd64
 Experimental: true

and for node inspect:

cat node-inspect.txt
[
    {
        "ID": "z2h81e0en4km17xj4fy3w7d1d",
        "Version": {
            "Index": 8715
        },
        "CreatedAt": "2017-04-12T05:55:54.189525855Z",
        "UpdatedAt": "2017-05-01T10:46:00.671308998Z",
        "Spec": {
            "Labels": {
                "com.docker.ucp.SANs": "10.20.2.220,127.0.0.1,localhost,ucp-worker-10-20-2-220-eu-west-1b,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2",
                "environment": "production",
                "region": "eu-west-1",
                "zone": "b"
            },
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "ucp-worker-10-20-2-220-eu-west-1b",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 4000000000,
                "MemoryBytes": 15768662016
            },
            "Engine": {
                "EngineVersion": "17.04.0-ce",
                "Plugins": [
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "ipvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            }
        },
        "Status": {
            "State": "ready",
            "Addr": "10.20.2.220"
        }
    }
]
[
    {
        "ID": "ttj80hvbbjo1ny5g1kye3g5le",
        "Version": {
            "Index": 8714
        },
        "CreatedAt": "2017-04-12T05:53:23.70300962Z",
        "UpdatedAt": "2017-05-01T10:46:00.245186769Z",
        "Spec": {
            "Labels": {
                "com.docker.ucp.SANs": "nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.10.2.28,127.0.0.1,localhost,ucp-worker-10-10-2-28-us-west-2b",
                "environment": "production",
                "region": "us-west-2",
                "zone": "b"
            },
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "ucp-worker-10-10-2-28-us-west-2b",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 4000000000,
                "MemoryBytes": 15768571904
            },
            "Engine": {
                "EngineVersion": "17.04.0-ce",
                "Plugins": [
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "ipvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            }
        },
        "Status": {
            "State": "ready",
            "Addr": "10.10.2.28"
        }
    }
]
[
    {
        "ID": "xn9koneufgqi3c903ghpa1v31",
        "Version": {
            "Index": 8714
        },
        "CreatedAt": "2017-04-12T05:53:55.475549039Z",
        "UpdatedAt": "2017-05-01T10:46:00.245482053Z",
        "Spec": {
            "Labels": {
                "com.docker.ucp.SANs": "127.0.0.1,localhost,ucp-worker-10-10-3-239-us-west-2c,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.10.3.239",
                "environment": "production",
                "region": "us-west-2",
                "zone": "c"
            },
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "ucp-worker-10-10-3-239-us-west-2c",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 4000000000,
                "MemoryBytes": 15768571904
            },
            "Engine": {
                "EngineVersion": "17.04.0-ce",
                "Plugins": [
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "ipvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            }
        },
        "Status": {
            "State": "ready",
            "Addr": "10.10.3.239"
        }
    }
]
[
    {
        "ID": "zcnpu0icgwe12ki01bh06c489",
        "Version": {
            "Index": 8715
        },
        "CreatedAt": "2017-04-12T05:56:05.073664497Z",
        "UpdatedAt": "2017-05-01T10:46:00.671398378Z",
        "Spec": {
            "Labels": {
                "com.docker.ucp.SANs": "ucp-worker-10-20-1-10-eu-west-1a,nzdq-es53-v5pr-posh-z2bi-ei3d-caea-45sm-lhtp-qa3z-q33e-a4u2,10.20.1.10,127.0.0.1,localhost",
                "environment": "production",
                "region": "eu-west-1",
                "zone": "a"
            },
            "Role": "worker",
            "Availability": "active"
        },
        "Description": {
            "Hostname": "ucp-worker-10-20-1-10-eu-west-1a",
            "Platform": {
                "Architecture": "x86_64",
                "OS": "linux"
            },
            "Resources": {
                "NanoCPUs": 4000000000,
                "MemoryBytes": 15768662016
            },
            "Engine": {
                "EngineVersion": "17.04.0-ce",
                "Plugins": [
                    {
                        "Type": "Network",
                        "Name": "bridge"
                    },
                    {
                        "Type": "Network",
                        "Name": "host"
                    },
                    {
                        "Type": "Network",
                        "Name": "ipvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "macvlan"
                    },
                    {
                        "Type": "Network",
                        "Name": "null"
                    },
                    {
                        "Type": "Network",
                        "Name": "overlay"
                    },
                    {
                        "Type": "Volume",
                        "Name": "local"
                    }
                ]
            }
        },
        "Status": {
            "State": "ready",
            "Addr": "10.20.1.10"
        }
    }
]
aaronlehmann commented 7 years ago

I think I understand what's going on. You have two placement preferences specified: one for region, then another for zone. This tells the scheduler to that its first priority should be getting an even split between regions, and then within regions, tasks should be divided evenly between nodes in different zones.

There seem to be 2 different values for the region label:

"region": "us-west-2"
"region": "eu-west-1"

so we'd expect tasks to be roughly split between us-west-2 and eu-west-1.

as you can see zone eu-west-1b got 2/5 tasks

This makes sense, because the 5 tasks had to first be split between eu-west-1 and us-west-2. In this case, eu-west-1 got 3 of the 5 tasks. Then those tasks were split between zones a and b, and b got 2 of those 3.

and zone us-west-2a got no tasks at all.

I don't see any nodes in region us-west-2 and zone a. They all seem to be in zones b or c.

If your goal is to split evenly over the different zones, without regard to region, you could give the nodes labels with values like eu-west-1b, and just specify one placement preference referencing that label.

nicolaka commented 7 years ago

@aaronlehmann there are nodes labeled with zone=a I just listed the nodes that this services was running on.

What's interesting is if I just the services with replicas=5, then it will spread them across the 5 zones. But that is now what I get when i launch with 3 replicas then scale to 5. Not sure if this is by design.

aaronlehmann commented 7 years ago

If it's possible to add node inspect output for every node, I can take a look and see if the behavior is consistent with the design.

There is some randomness involved - for example when 5 tasks across two regions, one of the regions will get 2 tasks and the other will get 3. It may be that when you ran service create you got an outcome that happened to be what you wanted, but when you scaled the service later it happened to give different results.

aaronlehmann commented 7 years ago

@nicolaka: Does this still look like a bug? Are you able to provide node inspect output for every node?

kent-h commented 7 years ago

@aaronlehmann, assuming your original understanding is correct, this should probably still be considered a bug.

This split-on-first-placement-pref then split-on-second-placement-pref seems like a faulty design. The scheduler should consider all nodes & all rules at once.

For example, the scheduler should be smart enough to determine that placing one more task in us-west-2 is better (satisfies more constraints, without compromise) than placing that same task in eu-west-1.

aaronlehmann commented 7 years ago

@Kent-H: I suppose that when the number of tasks does not split evenly, we could favor branches of the tree with the most nodes underneath them, instead of choosing arbitrary branches for the remaining tasks.