Closed jippi closed 7 years ago
@jippi Fairly positive you hit this bug: https://github.com/hashicorp/nomad/pull/1762.
A short term fix would be to use exec versus raw_exec. I also suggest you reserve some CPU and Memory on the nodes otherwise you are allowing Nomad to allocate the whole machines memory
@dadgar okay, i've reserved some CPU / RAM for nomad now (2.5GHz and 512MB)
Any ETA for a release containing that fix? Also, any suggestion on how I could verify if its indeed that issue?
@jippi Did you end up verifying? Hopefully in a 1-2 weeks
@dadgar the error happened again today, even though I did an allocation limit on nomad.
I'm honestly not good enough at Go to be confident a custom build would be production grade. If you got time it would be amazing to get a amd64 linux build with the cherry-picked commit and I can test it out - or guidance on how to make a production grade build for amd64 :)
The super odd thing is that it's only two out of 7 boxes that have the issue. Same kernel version and everything, only difference is physical hardware vs virtual kvm server
@dadgar since i cherry-picked the commit you suggested from #1762 i've not observed the issue ! :)
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Hi,
Running 0.4.1 i'm seeing some weird behavior from nomad - it happens ~daily on our web servers
The result of the behavior is the server becomes unresponsive for ~10min while things OOM and then slowly recover again. No allocations or changes are done during these outages, the one in the logs now happened on a saturday with no one working or being logged into the systems.
Also observing that the node is in
ready
mode, butsystem
jobs will not actually restart on the node (Job + Allocation) - restaring nomad makes the allocation succeed again - https://gist.github.com/jippi/046840d5c6c65b4e0e1ea32ea2424242Log (with debug on) https://gist.github.com/jippi/95b88ef66fd592206406ba9d312ca228
Interesting enough, the two clients that this behavior happen on is physical servers, where the $x other clients in the cluster, running inside kvm, don't act up like this.
They are provisioned identically with
puppet
, and their only major differences is physical vs virtual machine, and that the web boxes (which see this issue) also have active docker jobs running. Where the other servers got docker running, but nothing allocated on docker.Allocation executor logs
https://gist.github.com/jippi/83a32fce9d409a32fa6175b5793d7c2c
config.hcl
nomad agent-info
node as seen from /v1/node/:id
Example allocation from the server
Observed from datadog
Observed from newrelic (1)
Observed from newrelic (2)
From NewRelic, the data includes both
nomad agent
and the differentnomad executor
instances, I'm unable to split them apart.