hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.61k stars 1.93k forks source link

Low throughput of batch jobs allocations on clients #14349

Open cqueinnec opened 1 year ago

cqueinnec commented 1 year ago

Nomad version

Operating system and Environment details

Issue

We're running Nomad since a few years with small volumes of batch jobs. Now I'm analyzing how Nomad behaves when running big number of batch jobs, of different natures.

My first experiment is to see how many tasks can be dispatched on a node, and how fast they are processed. For this, I'm running 1 very simple job with count=1000 - my first try of creating 1000 jobs or more didn't go so well due to memory management and state replication, but that's another story. Problem: I'd expect those tasks to be processed in a few seconds, considering the client has more than enough resource to consume tasks concurrently. But instead, they are processed by small batches of 2 to 6 tasks in parallel.

I've tried to fiddle with configuration on GC management (client and server side), I took a look at c2m and other resources on the Internet, but to no avail. I also want to mention that I've read https://github.com/hashicorp/nomad/issues/13933 which is a gold mine of informations on how to dispatch huge amount of batch jobs. I'd love to see an advanced section in the docs, explaining how to properly configure servers and clients to achieve high volumes of processing :)

Any help or guidelines would be appreciated. Thanks!

Reproduction steps

Run below job file, and look at the processing on the UI.

Expected Result

Considering how small the task is (executing an echo "Hello World!" with very small resources) I'd expect allocs te be dispatched by hundreds.

Actual Result

Allocations are created and processed little by little. I can witness from 2 to 6 allocs being executed concurrently on the client:

2022-08-26_11h47_19

To me, the problem doesn't seem to be located on the scheduler, because in the UI I can see that the client has already been determined a few seconds after the job starts:

image

Job file (if appropriate)

Job file, started with Levant (levant.exe deploy -force -force-count .\nomad-job-hello-world--one-group.hcl):

job "hello-world-[[ timeNowUTC ]]-[[ uuidv4 ]]" {
    datacenters = ["dc1"]
    type = "batch"

    constraint {
        attribute = "${node.unique.name}"
        operator  = "="
        value     = "my-nomad-client"
    }

    group "group" {
        count = 1000
        restart {
            attempts = 0
            mode = "fail"
        }
        reschedule {
            attempts  = 0
            unlimited = false
        }

        task "ping" {
            driver = "raw_exec"
            config {
                command = "C:\\Windows\\System32\\cmd.exe"
                args = ["/c echo 'Hello World!'"]
            }

            resources {
                cpu    = 20
                memory = 20
            }

            logs {
                max_files     = 1
                max_file_size = 5
            }
        }

        ephemeral_disk {
            size    = 20
        }
    }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented 1 year ago

Hi @cqueinnec 👋

Given that all allocations were created and placed in the pending state I would say that the servers are performing as expected, so the client may be the bottleneck.

When you click in those pending allocations, then in the task (ping in your sample job), what task events do you see? There should be a table like this:

image

This could show what those tasks are waiting for.

cqueinnec commented 1 year ago

Hello,

Thanks for your feedback! Here's a video showing that for a task received 2 minutes ago (and for which the client seem to already have been determined), as long as the status is pending there is no task for the allocation. Once it starts, the events table shows that the whole process is performed in 3 seconds or so.

2022-09-07_06h50_08

It really feels like the client takes some time to grab the task, even though there's plenty of resources left. Let me know if there's a way to get more details. Thanks!

cqueinnec commented 1 year ago

Could this be related to the number of available threads on the machine? Or is client execution not related to available threads (considering it's Go, it might just use go routines)?

tgross commented 1 year ago

@cqueinnec can you provide a still screenshot (or CLI output) of the Task Events, rather than a gif?

cqueinnec commented 1 year ago

Here's the Recent Events table for a single task. As mentioned, once the task starts, the events table shows that the whole process is performed in very few (2 or 3) seconds.

image