ephemeral_disk documentation needs clarification

GuyAtTheFront commented 7 months ago

Proposal

The official docs for sticky and migrate uses the language:

... specifies that Nomad should make a best-effort attempt to ...

Is it possible to specify what "best-effort attempt" means? And/or possibly give examples of instances where Nomad will intentionally not or will fail to uphold sticky or migrate?

Use-cases

It is common for services to write state data to disk to help recover from failures. State data might include checkpoints, persistent queues, or unflushed data chunks.

Knowing exactly why, when, how and how often these services might not recover from failures is a critical consideration for projects, since those instance might result in permanent data loss. It also determines if additional replication factor or external state store is required to mitigate the risk.

Attempted Solutions

Current workaround is to avoid the ephemeral_disk block entirely because the behavior is unknown, and the possibility of "data-loss" without root-cause is not ideal for production. Instead, a single nfs is mounted across all nomad client nodes, then mounted to client > group > task for state data to be written to.

However, this solution adds a dependency that sits outside of the nomad ecosystem, which makes the job less reusable. The dependency also adds a layer of bureaucracy for deployments, since it is not in the control of the infra devs (is done by sysadmin).

(Hi again, tgross!)

tgross commented 7 months ago

Hi @GuyAtTheFront! Agreed that those docs could use some improvement, and I've done so in #20357. But the short version is that you definitely don't want to use ephemeral disk migration for anything you can't recreate at the destination allocation. It's best used for things like on-disk cache.

If you have data you need for correctness like non-idempotent checkpoints, I strongly recommend instead using something like the Task API in combination with Variable Locks. Or use persistent disk via host volumes or CSI.

GuyAtTheFront commented 7 months ago

I've somehow missed the docs you linked, will have a read.

Wrt to this ticket - it is generally understood that all code is on a "best effort" basis. There's always a chance something breaks the code. For the docs to explicitly point this out implies something out of the ordinary. This could be the result of a design choice / technical limitation, or from empirical testing / user feedback. Either way, it'll be great to document why the doc author(s) would think these features are less robust than what one would normally expect.

Alternatively, nothing is wrong and I'm overthinking. Perhaps there's really no gold under the "no gold buried here" sign. In that case, perhaps remove the "best effort basis" part in the docs.

tgross commented 7 months ago

For the docs to explicitly point this out implies something out of the ordinary.

Hopefully the PR I've pushed up explains a bit more why this is out of the ordinary. For example: "Successful migration requires that the clients can reach each other directly over the Nomad HTTP port." It's totally possible you as the cluster admin don't have a flat network topology (i.e. your clients can't communicate with each other because they're in far-flung edge environments). So in this case ephemeral disk migration will simply fail, but Nomad treats this as a normal condition and doesn't fail the whole deployment for it.

GuyAtTheFront commented 7 months ago

Tested on my end, works as described 👍

Just to confirm:

Does Nomad HTTP port refer to this section of the agent configuration?
Does the ephemeral disk migration happen over HTTP or over tcp via something like SFTP / SCP? If it's over HTTP, does that mean that there exist some endpoint that I can play with?

tgross commented 7 months ago

Does Nomad HTTP port refer to this section of the agent configuration?

Yes

Does the ephemeral disk migration happen over HTTP or over tcp via something like SFTP / SCP? If it's over HTTP, does that mean that there exist some endpoint that I can play with?

It's over HTTP but you can't really play with it... :grinning: This is sort of an odd-ball operation in Nomad. Normally we use a MessagePack RPC over yamux (TCP) for all agent-to-agent communication. But migration happens over HTTP. In order to make this safe to do with Nomad's ACL system, the Nomad server issues the clients a one-time "migration token" (a hash of the node secret) that needs to be used in the X-Nomad-Token header. The destination node sends that in a request to the undocumented /v1/client/allocation/:alloc_id/snapshot API on the source node, and the source node checks that migration token for validity.

If we were designing this feature today, we'd almost certainly use a signed request via Workload Identity, but that wasn't available to us years ago. :grinning:

GuyAtTheFront commented 7 months ago

Alright, understood. Thanks Tim the detailed explanation!

hashicorp / nomad