hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

when we use raw_exec driver we can get "text file busy" error at launch #6002

Closed tantra35 closed 4 years ago

tantra35 commented 5 years ago

Nomad version

0.9.3

When we use raw_exec driver we some time see failed allocations with follow error in them:

Recent Events:
Time                       Type                   Description
2019-07-24T13:58:07+03:00  Killing                Sent interrupt
2019-07-24T13:57:51+03:00  Not Restarting         Error was unrecoverable
2019-07-24T13:57:51+03:00  Driver Failure         failed to launch command with executor: rpc error: code = Unknown desc = failed to start command path="/var/lib/nomad/alloc/5820ad73-7818-7a8d-bacc-0a4f71825f5a/fluend/local/fluent-bit" --- args=["/var/lib/nomad/alloc/5820ad73-7818-7a8d-bacc-0a4f71825f5a/fluend/local/fluent-bit" "-c" "/var/lib/nomad/alloc/5820ad73-7818-7a8d-bacc-0a4f71825f5a/fluend/local/td-agent-bit.conf"]: fork/exec /var/lib/nomad/alloc/5820ad73-7818-7a8d-bacc-0a4f71825f5a/fluend/local/fluent-bit: text file busy
2019-07-24T13:57:51+03:00  Downloading Artifacts  Client is downloading artifacts
2019-07-24T13:57:51+03:00  Task Setup             Building Task Directory
2019-07-24T13:57:51+03:00  Received               Task received by client

We think that this happened due follow definition of task

task "fluend"
{
    driver = "raw_exec"

    artifact
    {
        source = "s3::https://s3.amazonaws.com/some-path on s3/td-agent/fluent-bit.0.14.9.tar.gz"
    }

    config
    {
        command = "fluent-bit"
        args = ["-c", "${NOMAD_TASK_DIR}/td-agent-bit.conf"]
    }
}

Nomad at sometimes doesn't wait a full extract of artifact, and try to launch it

notnoop commented 5 years ago

That's an interesting bug - thanks for reporting it. We'll need to investigate the cause here. Does it happen reliably or only in a subset of the tasks? What about other drivers?

The "text file busy" error indicates that go-getter is still expanding tarball and hasn't closed the file descriptor before filuent-bit is invoked. This matches your hypothesis. However, in my simple scenarios, I haven't been able to reproduce it yet with 0.9.3 or latest master.

tantra35 commented 5 years ago

@notnoop First time we began observe this when launch "spark on nomad" where we launch huge amount of jobs and time to time we can see this (fluent-bit is just log collector satellite). Then we change driver to exec where we doesn't observe this behavior, but I explain this by the fact that the exec requires the creation of a chroot environment and this delay in creation is enough to completely unpack the file and error doesn't happens

But what i described here happened in our production cluster on one of autoscale jobs, before on nomad 0.8.6 we not observed this

stale[bot] commented 4 years ago

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale[bot] commented 4 years ago

Hey there

Since this issue hasn't had any activity in a while - we're going to automatically close it in 30 days. If you're still seeing this issue with the latest version of Nomad, please respond here and we'll keep this open and take another look at this.

Thanks!

stale[bot] commented 4 years ago

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem :+1:

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.