hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.78k stars 1.94k forks source link

does nomad must honor collection_interval when collect telemetry by allocations? #6961

Closed tantra35 closed 4 years ago

tantra35 commented 4 years ago

Nomad version

Nomad v0.10.2 (40edb4d3dd7145d2c535160fa6f7f3eb0cb4b8f7+CHANGES)

Is it normal that nomad doesn't honor collection_interval (https://www.nomadproject.io/docs/configuration/telemetry.html#collection_interval) when collecting statistic by allocations

For example on nomad client we have flow telemetry configuration:

telemetry
{
    disable_hostname                       = true
    publish_allocation_metrics             = true
    publish_node_metrics                   = true
    collection_interval                    = "60s"
    disable_dispatched_job_summary_metrics = true
    statsd_address                         = "localhost:8125"
    prefix_filter                          = [
        "-nomad.runtime",
    ]                                                                                                                                                                                                                                                                            
}

And i think that telemetry for allocation will be collected every 60 seconds, but this is not true, and statistics for allocation collected every 1 second, at localhost:8125 we have custom agregated statsd proxy and can intercept all metrics, so from logs from this daemon we have follow frequency on metrics:

[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.355] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:55.361] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.356] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:56.361] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.364] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.kernel_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.memory.allocated.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.365] DEBUG Got metric: nomad.client.allocs.cpu.total_percent.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.366] DEBUG Got metric: nomad.client.allocs.cpu.system.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.user.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.throttled_time.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.throttled_periods.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:57.461] DEBUG Got metric: nomad.client.allocs.cpu.total_ticks.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.rss.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.cache.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.swap.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default
[2020-01-20 15:36:58.365] DEBUG Got metric: nomad.client.allocs.memory.max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default

for example metric nomad.client.allocs.memory.kernel_max_usage.vault_debug-00.test.5f9fb00c-93da-2a3c-80ba-21322f989a6a.vault_debug_task.default collected evry 1 second

tgross commented 4 years ago

Hi @tantra35! It looks like client stats are collected at the specified interval but task resource usage stats collection is a tight loop around the collection here: stats_hook.go#L87-L112. Based on the more detailed treatment of the configuration in the telemetry metrics docs I'm not sure whether or not that's intentional and just a documentation gap.

tgross commented 4 years ago

My colleague @notnoop has pointed out to me that we do pass the interval to the task driver's Stats() method (see here) so that the drivers can run the stats collection on that interval. The exact implementation is up to the driver though. So either we're missing a spot to thread through the interval from the config or the specific task driver isn't implementing it correctly. Which task driver are you using here, @tantra35 ?

tantra35 commented 4 years ago

@tgross on test stand for simplicity purposes we use exec driver

tantra35 commented 4 years ago

@tgross thanks for clarify as code works. After some debugging i found that value of interval passed to exec.(*Driver).TaskStats is 1s:

(dlv) break exec.(*Driver).TaskStats
(dlv) c
> github.com/hashicorp/nomad/drivers/exec.(*Driver).TaskStats() /opt/gopath/src/github.com/hashicorp/nomad/drivers/exec/driver.go:482 (hits goroutine(2011):1 total:3) (PC: 0x141bbe3)
Warning: debugging optimized function
(dlv) bt
0  0x000000000141bbe3 in github.com/hashicorp/nomad/drivers/exec.(*Driver).TaskStats
   at /opt/gopath/src/github.com/hashicorp/nomad/drivers/exec/driver.go:482
1  0x0000000001311332 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*DriverHandle).Stats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/driver_handle.go:49
2  0x0000000001311332 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*LazyHandle).Stats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/lazy_handle.go:141
3  0x0000000001316e4d in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*statsHook).callStatsWithRetry
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/stats_hook.go:126
4  0x0000000001316d65 in github.com/hashicorp/nomad/client/allocrunner/taskrunner.(*statsHook).collectResourceUsageStats
   at /opt/gopath/src/github.com/hashicorp/nomad/client/allocrunner/taskrunner/stats_hook.go:92
5  0x00000000004625e1 in runtime.goexit
   at /usr/local/go/src/runtime/asm_amd64.s:1337
(dlv) print interval
github.com/hashicorp/nomad/vendor/github.com/gorilla/websocket.writeWait (1000000000)

so problem is in interval value passing code, and it looks like all drivers doensn't honor collection_interval

tgross commented 4 years ago

Thanks @tantra35! We'll look into this!

tantra35 commented 4 years ago

@tgross ah sorry, just found and this is our misconfiguration in provisioning, so bug doesn't realy present

tantra35 commented 4 years ago

@tgross after some investigations i mast reopen this issue, due problem actualy exist at least in exec driver. interval value passed correctly(we made wrong conclusions due mistake in provisioning(there was not set collection_interval value)), other info menitioned in the beginning of this issue are correct

tgross commented 4 years ago

No problem. 😀 We'll take a look.

tgross commented 4 years ago

Repro from our testing for 0.10.4's upcoming release candidate.

Run an exec job and find its allocs: ``` ▶ nomad job run e2e/metrics/input/helloworld.nomad ▶ nomad job status helloworld ... Allocations ID Node ID Task Group Version Desired Status Created Modified 5ebd84ce d442d255 hello 0 run running 8s ago 8s ago 602b99d1 0427d6f3 hello 0 run running 8s ago 7s ago feaf805d d442d255 hello 0 run running 8s ago 7s ago ``` ssh into the node `d442d255`. Update the telemetry config at `/etc/nomad.d/base.hcl` to increase the collection interval and emit stats: ```diff ▶ diff e2e/terraform/shared/nomad/base.hcl base.hcl 14c14,15 < collection_interval = "1s" --- > collection_interval = "60s" > statsd_address = "localhost:8125" ``` Create a statsd config: ```js { graphitePort: 2003 , port: 8125 , backends: [ "./backends/console" ] , dumpMessages: true } ``` Run a statsd collector on the node, using that configuration file: ``` $ sudo docker run -d --name statsd \ -v "$(pwd)/config.js:/usr/src/app/config.js" \ --network=host statsd/statsd:latest ``` Restart the Nomad client, wait a while, and then check that statsd is getting logs at the expected rate. We'll receive debug messages every 60 -- these are Nomad's collection intervals. The lines between (every 10 seconds) are statsd's own flushing of all the metrics it knows to console. ``` $ sudo systemctl restart nomad $ sudo docker logs -f statsd | grep 'cpu\.total_ticks\.helloworld\.hello\.feaf' 5 Feb 19:52:00 - DEBUG: nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default:0.383332|g 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, 5 Feb 19:53:00 - DEBUG: nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default:0.383332|g 'nomad.client.allocs.cpu.total_ticks.helloworld.hello.feaf805d-c518-65b9-ee41-c4d301a04c33.hello.default': 0.383332, ```
github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.