hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
15k stars 1.96k forks source link

System Scheduler use new Update stanza and Deployments #4740

Open aaroncline opened 6 years ago

aaroncline commented 6 years ago

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.8.4 (dbee1d7d051619e90a809c23cf7e55750900742a)

Operating system and Environment details

CentOS 7 Consul v1.0.7 fabiolb 1.5.6 3 nomad clients 3 nomad servers

Issue

When deploying Fabio using the system scheduler and the exec driver, Nomad does not seem to respect the Update section hierarchy between the job and group sections.

Also, it does not seem as though Nomad treats this as a "deployment". No deployment ID is available in the job submission evaluation.

Reproduction steps

Use the job file below to launch fabio into an environment. Alter the force_job_restart epoch ENV and redeploy and you should see all fabio executions stop at essentially the same time. There is also no deployment ID which is how we track successful deployments on our service scheduled tasks. If you then change the job Update section to match the Group Update section, the tasks will be staggered appropriately.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

{
    "Job": {
        "AllAtOnce": false,
        "Constraints": null,
        "CreateIndex": 635411,
        "Datacenters": [
            "us-east-1"
        ],
        "Dispatched": false,
        "ID": "fabio",
        "JobModifyIndex": 855141,
        "Meta": null,
        "Migrate": null,
        "ModifyIndex": 855141,
        "Name": "fabio",
        "Namespace": "default",
        "ParameterizedJob": null,
        "ParentID": "",
        "Payload": null,
        "Periodic": null,
        "Priority": 50,
        "Region": "aws",
        "Reschedule": null,
        "Stable": false,
        "Status": "running",
        "StatusDescription": "",
        "Stop": false,
        "SubmitTime": 1538421216716838569,
        "TaskGroups": [
            {
                "Constraints": null,
                "Count": 1,
                "EphemeralDisk": {
                    "Migrate": false,
                    "SizeMB": 300,
                    "Sticky": false
                },
                "Meta": null,
                "Migrate": null,
                "Name": "devops",
                "ReschedulePolicy": null,
                "RestartPolicy": {
                    "Attempts": 2,
                    "Delay": 15000000000,
                    "Interval": 1800000000000,
                    "Mode": "fail"
                },
                "Tasks": [
                    {
                        "Artifacts": [
                            {
                                "GetterMode": "any",
                                "GetterOptions": {
                                    "checksum": "sha256:2dfe26aaa74b659a0e595654eb8f9247d947cbf652cbebe03fd8133c2851cb4a"
                                },
                                "GetterSource": "https://github.com/fabiolb/fabio/releases/download/v1.5.6/fabio-1.5.6-go1.9.2-linux_amd64",
                                "RelativeDest": "local/"
                            }
                        ],
                        "Config": {
                            "command": "fabio-1.5.6-go1.9.2-linux_amd64"
                        },
                        "Constraints": null,
                        "DispatchPayload": null,
                        "Driver": "exec",
                        "Env": {
                            "force_job_restart": "1538421216"
                        },
                        "KillSignal": "",
                        "KillTimeout": 5000000000,
                        "Leader": false,
                        "LogConfig": {
                            "MaxFileSizeMB": 10,
                            "MaxFiles": 10
                        },
                        "Meta": null,
                        "Name": "devops_fabio_exec",
                        "Resources": {
                            "CPU": 200,
                            "DiskMB": 0,
                            "IOPS": 0,
                            "MemoryMB": 512,
                            "Networks": [
                                {
                                    "CIDR": "",
                                    "Device": "",
                                    "DynamicPorts": null,
                                    "IP": "",
                                    "MBits": 10,
                                    "ReservedPorts": [
                                        {
                                            "Label": "fabio_9999",
                                            "Value": 9999
                                        },
                                        {
                                            "Label": "fabio_9998",
                                            "Value": 9998
                                        }
                                    ]
                                }
                            ]
                        },
                        "Services": [
                            {
                                "AddressMode": "auto",
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "",
                                        "Args": null,
                                        "CheckRestart": null,
                                        "Command": "",
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "Id": "",
                                        "InitialStatus": "",
                                        "Interval": 10000000000,
                                        "Method": "",
                                        "Name": "service: \"fabio\" check",
                                        "Path": "",
                                        "PortLabel": "fabio_9999",
                                        "Protocol": "",
                                        "TLSSkipVerify": true,
                                        "Timeout": 5000000000,
                                        "Type": "tcp"
                                    }
                                ],
                                "Id": "",
                                "Name": "fabio",
                                "PortLabel": "fabio_9999",
                                "Tags": null
                            }
                        ],
                        "ShutdownDelay": 0,
                        "Templates": null,
                        "User": "",
                        "Vault": null
                    }
                ],
                "Update": {
                    "MaxParallel": 1,
                    "Stagger": 30000000000
                }
            }
        ],
        "Type": "system",
        "Update": {
            "MaxParallel": 0,
            "Stagger": 0
        },
        "VaultToken": "",
        "Version": 7
    }
}
aaroncline commented 6 years ago

I misreported initially and have made some edits. This actually appears to be a bug in the hierarchy of the Group and Job Update stanza's. According to your docs, the Group stanza should have the higher precedence. https://www.nomadproject.io/docs/job-specification/update.html

dadgar commented 6 years ago

@aaroncline Hey Aaron,

The system job currently doesn't support the new update system using deployments. You can see the callout here: https://www.nomadproject.io/docs/job-specification/update.html

I am going to rename the issue to reflect this

ricbartm commented 5 years ago

Hello @dadgar . We have a use case where we want to deploy a custom job on every node of a pool of nodes distributed around the globe and we thought the system scheduler is the best fit for this use case. Nevertheless, given that the new deployment and deployment stanza configurations are not being honoured, we may need to workaround it by using the service scheduler, some job contraints to avoid multiple copies of same job deployed in the same node, and some automation to increase the overall job count number to match our cluster size if it grows or shrinks. This is doable, but far from ideal.

Said that, this issue has been opened long time ago and it had very few activity. So, is there any chance that you could share with me what the plans are of this? I'd like to set some expectations (even the answer is "we don't have plans for this") to be able to take the most informed decision about it.

Finally, a shout to other folks, but specially to @aaroncline to know how they finally workaround this issue for their use case.

calavera commented 4 years ago

@dadgar we're investigating using Nomad at Netlify for a large heterogeneous deployment. Solving this issue would help us tremendously to decide whether to use Nomad. Is there anything we can do to help it move forward? The documentation says that this will be fixed in "future releases", but it'd be great to know whether you have more specific plans to address it.

schmichael commented 4 years ago

That's super exciting @calavera! nomadproject.io itself uses Netlify, so it would be exciting to be "self-hosted" in a way.

Unfortunately this feature is not planned for the upcoming 0.11.x or 0.12.0 releases. It is absolutely in our queue for prioritization after 0.12.0, but I don't want to make any promises at this time. Would it be possible to elaborate on your use case in case there's a workaround we could help provide?

I'll try to update this issue when it's prioritized on our roadmap.

apkrymov commented 2 years ago

@schmichael Any updates? We really need this feature Our use case same as mentioned @ricbartm. We need to deploy service based on the host constraints, not on fixed count of replicas in cluster. So, system scheduler fits perfectly, but we can not control deployment process due to current Update stanza limitations.

ebarriosjr commented 2 years ago

@schmichael any updates? We could also really use this feature. Thanks!

schmichael commented 2 years ago

Unfortunately no updates. Sorry for letting this slip. Definitely still something we want to do, but I don't want to keep overpromising and underdelivering on timelines. :grimacing:

axsuul commented 2 years ago

Same here, we make heavy use of system jobs and really need a way to do rolling updates for them.

hyungjic commented 6 months ago

@schmichael Any updates on this issue?😄