Allocation Logs Immediately Removed on Allocation Replacement

thetoothpick commented 1 year ago

Nomad version

# nomad -version
Nomad v1.5.6
BuildDate 2023-05-19T18:26:13Z
Revision 8af70885c02ab921dedbdf6bc406a1e886866f80

Operating system and Environment details

# uname -a
Linux ip-10-3-6-38.ec2.internal 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Issue

When our Nomad service allocations finish (either successfully or unsuccessfully) and are replaced by another allocation from the same task group on the same client, all allocation logs on the filesystem are immediately removed. This makes it difficult to capture things that are only logged to standard out or error, especially if the allocation crashed or failed. It seems like the entire allocation directory is garbage collected immediately when the allocation is replaced, but we haven't been able to find any logs indicating why (or that this is actually what's happening).

These are our GC-related settings:

server {
  job_gc_threshold              = "24h"
  eval_gc_threshold             = "24h"
  deployment_gc_threshold       = "24h"
  csi_volume_claim_gc_threshold = "3m"
}

We do have sticky ephemeral disks enabled, and the alloc/data directory is transferred to the new allocation successfully.

On some hosts that run fewer allocations, we're sometimes able to see logs for an extended period after the previous allocations stop and are replaced, so the problem isn't universal. We've also checked the disk usage on the hosts when this happens and they're all very low (usually 2-5% used).

We have developed a "poststop" task to save the allocation logs after the allocation finishes, to try to combat this and help debug errors that occur in our jobs. However, sometimes even this poststop task fails to start with this error:

logmon: rpc error: code = Unknown desc = failed to create stdout logfile for "save-nomad-logs.stdout": open /.../alloc/logs: no such file or directory.

This seems to indicate that the allocation (log) directory is being deleted before the allocation even completes.

We're not sure what might be causing this.

Reproduction steps

Configure Nomad with the following garbage collection (GC) related settings:

server {
 job_gc_threshold              = "24h"
 eval_gc_threshold             = "24h"
 deployment_gc_threshold       = "24h"
 csi_volume_claim_gc_threshold = "3m"
}

Enable sticky ephemeral disks.
Start Nomad service and deploy multiple allocations for a job on the same client.
Observe that upon replacement of an allocation, all allocation logs on the filesystem are immediately removed.

Expected Result

When an allocation finishes and is replaced, the allocation logs should not be immediately removed, allowing for the capture of logs that may be useful for debugging purposes.

Actual Result

All allocation logs on the filesystem are immediately removed when the allocation is replaced, hindering the ability to capture critical logs for debugging.

Nomad Server logs (if appropriate)

Not gathered at debug level.

Nomad Client logs (if appropriate)

Not gathered at debug level.

Enabling debug logging requires a full redeploy of some of our infrastructure, but we can do it if required.

Please let us know the best way to continue investigating, or if anything obvious jumps out that we missed.

Thanks!

Kevin

thetoothpick commented 1 year ago

Just want to point out that I think this is distinct from #13632 (if I'm reading through it correctly), because the allocation history is still present in the UI (along with the allocation / task timelines, etc.), but we can't open the logs (because they're missing from the disk).

mr-karan commented 1 year ago

+1.

Chiming in with my usecase: We have a few cron jobs that run at midnight. It'd be nice to have a configurable GC interval for these allocs so that if there's any need to debug the crons using the logs, it's possible.

tgross commented 1 year ago

@thetoothpick As I noted in https://github.com/hashicorp/nomad/issues/13632#issuecomment-1177653516 the server GC and the client GC are separate operations with separate configuration. So you're right that #13632 is a different problem.

The client will emit info-level logs with text like "garbage collecting allocation", which would be useful to see here. Have you had a look at the client.gc_interval or the other gc_* configuration values on the client?

thetoothpick commented 1 year ago

hi @tgross - thanks for the pointers. we are hitting allocation garbage collection for certain allocations, usually due to full disks (which is a separate thing we're addressing).

We see a number of logs like this for these situations:

    2023-07-28T21:43:26.939Z [WARN]  client.gc: garbage collection skipped because no terminal allocations: reason="disk usage of 91 is over gc threshold of 80"
...
    2023-07-28T19:29:25.226Z [INFO]  client.gc: garbage collecting allocation: alloc_id=283a28be-b237-af3e-b22c-f20aa9b18100 reason="disk usage of 100 is over gc threshold of 80"

However, I don't think that's what's causing the log deletion that we're seeing. The disks of hosts where this happen is usually pretty empty, and while there is a marking allocation for GC line, there is no corresponding garbage collecting allocation line.

I've identified and gathered logs for an allocation where this behavior happened. I've included the tail end of it below, but can email the full log (contains potentially sensitive info).

Here, the allocation stopped and a new allocation replaced it on the same host. The allocation history remained, but the alloc dir was gone save for <id>/save-nomad-logs/secrets.

2023-07-28T17:24:19.441Z [INFO] client.gc: marking allocation for GC: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7
2023-07-28T17:24:19.441Z [INFO] client.alloc_runner: waiting for task to exit: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=c-media-driver
2023-07-28T17:24:19.443Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Task Setup" msg="Building Task Directory" failed=false
2023-07-28T17:24:19.483Z [INFO] agent: (runner) rendered "(dynamic)" => "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/local/config/application.conf"
2023-07-28T17:24:19.486Z [INFO] client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:tar Args:[-czvf /tmp/alloc-logs-667de301.tar.gz -C /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/alloc logs]}"
2023-07-28T17:24:19.537Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Started msg="Task started by client" failed=false
2023-07-28T17:24:19.586Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type=Terminated msg="Exit Code: 1" failed=false
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: not restarting task: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs reason="Policy allows no restarts"
2023-07-28T17:24:19.593Z [INFO] client.alloc_runner.task_runner: Task event: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs type="Not Restarting" msg="Policy allows no restarts" failed=true
2023-07-28T17:24:19.593Z [INFO] client.driver_mgr.raw_exec.executor: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 driver=raw_exec task_name=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667330
2023-07-28T17:24:19.602Z [INFO] client.alloc_runner.task_runner.task_hook.logmon: plugin process exited: alloc_id=667de301-622d-51ba-a54d-23bbc8d407b7 task=save-nomad-logs path=/nomad/nomad_zip/nomad pid=3667316
| \t* failed to remove alloc dir "/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7": unlinkat /nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7/save-nomad-logs/secrets: device or resource busy
2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7
previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This shows the "poststop" job (and failure) I described in the original issue.

The client.alloc_migrator: error destroying alloc dir line seems pretty relevant - is it standard procedure to delete the entire allocation directory when ephemeral_disk.sticky = true?

tgross commented 1 year ago

Thanks @thetoothpick, that context and those logs are very helpful.

2023-07-28T17:24:19.780Z [ERROR] client.alloc_migrator: error destroying alloc dir: alloc_id=7875fe18-3797-968c-2489-b54d011511ef previous_alloc=667de301-622d-51ba-a54d-23bbc8d407b7 previous_alloc_dir=/nomad/data/alloc/667de301-622d-51ba-a54d-23bbc8d407b7

This message "error destroying alloc dir" is coming from alloc_watcher.go#L281-L285. This file is where Nomad does the migration of the ephemeral data from one allocation to another when you have ephemeral_disk.migrate = true set.

We've got two different workflows here, one for when the previous allocation is "remote" (on a different client) and "local" for when allocation is on the same node. In the remote case, we create a new "previous alloc" directory to unpack the data we've downloaded in migrateAllocDir. And then we delete that when we're done because there's no need for it to exist.

But for the "local" case which is what you're seeing, we move the old alloc/data and tasks directories via Move method and then when that returns successfully we destroy the entire allocation directory, which includes the logs!

As far as I can tell there's no reason to destroy the previous allocation directory at all there, as the usual allocation garbage collection should clean this up later. But it might have made sense back in 2017 when that code was written. :grinning:

The fix should be small but I'll need to do some hands-on testing to make sure we're not breaking something I didn't think of. I've got a draft PR up here https://github.com/hashicorp/nomad/pull/18108 but it's getting towards EOD for me here so I'll try to get that tested out tomorrow.

tgross commented 1 year ago

https://github.com/hashicorp/nomad/pull/18108 has been merged and will ship in the next regular release of Nomad 1.6.x, with backports to 1.5.x and 1.4.x

hashicorp / nomad