hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.76k stars 1.94k forks source link

[question] For custom task drivers e.g. LXC, firecracker, can "interval" for TaskStats be configured? #7786

Closed shishir-a412ed closed 4 years ago

shishir-a412ed commented 4 years ago

We are authoring a custom task driver for nomad using https://www.nomadproject.io/docs/internals/plugins/task-drivers/

When you launch a task using the plugin, the following happens.

1) A call to StartTask which is responsible for launching the task. 2) A call to TaskStats which will return a channel, and the driver will constantly write stats to this channel. 3) A call to WaitTask which will return a channel. The driver will write exit results to this channel when the task exits or close the channel if the context is canceled.

I have a question regarding (2) TaskStats

When (2) is called, a goroutine handleStats is launched which reports stats every interval seconds.

Right now the default value is 1 or 2 seconds. Not sure which one is correct. My logs are showing 1 second however the value defined in the proto is 2 seconds https://github.com/hashicorp/nomad/blob/master/plugins/drivers/proto/driver.proto#L248

Can this be configured? Using the nomad plugin stanza or some other option? Reporting task stats every second is very CPU intensive and not every driver would like to do it that often.

Since this is a driver level implementation, driver should at least have a way to configure this value.

A very simple solution would be to just override it (interval = 5) after you receive the parameter in handleStats. However, I wanted to check if there is a better (right) way to configure this value?

tgross commented 4 years ago

Hi @shishir-a412ed!

I took a look at where this interval is defined and the loop is every 1s, but backs-off up to 5s. At that point the communication channel is established. See the stats_hook. But then the driver can decide to send stats on that channel on an interval of its choosing. Take a look at the shared code for the Linux executor (used by exec and java task drivers) for an example. So I think it's a matter of making sure the stats channel isn't being torn down after stats are being sent?

shishir-a412ed commented 4 years ago

@tgross If you look at the description of stats_hook it looks like the 5-second backoff is for retrying until the channel is established.

My question is more on once the channel is established, driver will constantly flood the channel with task stats (every 1 second).

We have followed the same code as Linux executor which is used by exec and java task drivers. Now to corroborate my claim, I added a logger (print) statement before timer.Reset(interval) and it is printing every 1 second.

Consider a use case where a custom task driver launches 100 tasks on a nomad client node. The driver will be fetching task stats for these 100 tasks every second and reporting it back to stats channels.

Is there an option to add an interval field in the plugin stanza of the nomad client HCL?

plugin "my plugin" {
  config {
    enabled = true
    interval = 5
  }
}

and then task driver can read this value interval = 5 and use it for timer.Reset(interval)

tgross commented 4 years ago

Ah yeah, I see what you're saying... that interval is being threaded all the way down from the stats hook configuration.

It's being added to the stats hook when the task runner hooks are initialized, which should be configured by the client.telemetry.collection_interval field in the Nomad client configuration. So this is configurable on a per-client basis.

If you wanted to have it available on a per-plugin basis for a custom plugin, you could add the field in the plugin's configuration schema, and then have your the plugin's config override the interval it gets from the client configuration.

shishir-a412ed commented 4 years ago

@tgross Maybe I am missing something trivial, but this is what I am doing:

1) Added "stats_interval": hclspec.NewAttr("stats_interval", "string", false), Somewhere here: lxc_path

2) Added a new field StatsInterval stringcodec:"stats_interval"` in myConfigstruct. Somewhere here: [LXCPath`](https://github.com/hashicorp/nomad-driver-lxc/blob/master/lxc/driver.go#L131)

3) Tried to access StatsInterval in StartTask

      d.logger.Info(fmt.Sprintf("Stats interval: %s", d.config.StatsInterval))

Similar to lxcPath()

I was hoping that parsing would happen automatically, and I will be able to access StatsInterval in (3). However, it's printing no value (empty) for me.

Any idea what's going on?

shishir-a412ed commented 4 years ago

Setting the value here:

$ cat win_client.hcl

plugin "win_iis" {
  config {
    enabled = true
    stats_interval = "55s"
  }
}
notnoop commented 4 years ago

@shishir-a412ed This looks correct, so I'm a bit surprised it's not working either. My hunch is a binary name mismatch - the plugin binary name must be win_iss (or win_iss.exe) for that plugin stanza to work.

For further debugging, I might suggest adding logging in driver.SetConfig function, and/or setting a default value like:

"stats_interval": hclspec.NewDefault(
    hclspec.NewAttr("stats_interval", "string", false),
    hclspec.NewLiteral(`"55s"`),
)

If this is an open source project, we can dig into it a bit more as well.

shishir-a412ed commented 4 years ago

@notnoop Thanks for taking a look. You are right! binary name mismatch was causing the problem. Changing the binary name to win_iis.exe fixed it, and I am able to read the value now.

The plugin is not open source right now, but we do plan to open source it at some point.

notnoop commented 4 years ago

Glad to hear it helped. I created https://github.com/hashicorp/nomad/issues/7807 to ease debugging or noticing this issue!

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.