Document how `os/spawn` should be cleaned up or handle zombie processes...

amano-kenji commented 5 months ago

For now, I discovered a few tricks for os/spawn as below.

(try
  (with [proc (os/spawn ["prog" "arg1"] :p {:out :pipe})]
    (ev/read (proc :out) :all nil 5))
  ([_] nil))

(try
  (with [proc (os/spawn ["prog" "arg1"] :p {:out :pipe})]
    (ev/with-deadline 5
      (os/proc-wait proc)))
  ([_] nil))

If os/spawned proc is not closed with with, the spawned process ends up devouring a CPU core. This behavior is either intended or a bug. If this behavior is intended, (doc os/spawn) should mention that os/spawned processes should be closed with :close, with, or os/proc-close.

To prevent zombie processes from blocking ev/read or os/proc-wait, I must specify a timeout through ev/read or ev/with-deadline. This is not documented somewhere. Ideally, a way to deal with zombie processes should be documented in the janet core.

pepe commented 5 months ago

I remember there was a flag for not collecting processes, turning them into zombies. Also, a quick search showed that there is a configuration for that.

sogaiu commented 5 months ago

Perhaps the following is not what you were thinking of, but this bit from os/execute's docstring seems related:

    * :d - Don't try and terminate the process on garbage 
      collection (allow spawning zombies).

Since os/spawn's docstring mentions:

    Execute a program on the system and return a handle to the process. 
    Otherwise, takes the same arguments as `os/execute`.

it would seem the intention is for :d to be usable for os/spawn as well.

Further, from the docstrings, it doesn't seem like :d is in effect by default, so if zombies can be spawned with the code examples above, the current behavior and docstrings don't seem to be in agreement.

On a related note, these lines seemed to be about garbage collection and dealing with zombies:

static int janet_proc_gc(void *p, size_t s) {
    (void) s;
    JanetProc *proc = (JanetProc *) p;
#ifdef JANET_WINDOWS
    if (!(proc->flags & JANET_PROC_CLOSED)) {
        if (!(proc->flags & JANET_PROC_ALLOW_ZOMBIE)) {
            TerminateProcess(proc->pHandle, 1);
        }
        CloseHandle(proc->pHandle);
        CloseHandle(proc->tHandle);
    }
#else
    if (!(proc->flags & (JANET_PROC_WAITED | JANET_PROC_ALLOW_ZOMBIE))) {
        /* Kill and wait to prevent zombies */
        kill(proc->pid, SIGKILL);
        int status;
        if (!(proc->flags & JANET_PROC_WAITING)) {
            waitpid(proc->pid, &status, 0);
        }
    }
#endif
    return 0;
}

Also, there is this line:

    int pipe_owner_flags = (is_spawn && (flags & 0x8)) ? JANET_PROC_ALLOW_ZOMBIE : 0;

and this line:

    proc->flags = pipe_owner_flags;

in the underlying C implementation of os/execute and os/spawn.

amano-kenji commented 5 months ago

Perhaps, IO pipe is preventing os/spawned processes from being cleaned up, and janet keeps trying to clean up the processes? That's why janet consumes 100% of a CPU core?

amano-kenji commented 5 months ago

:close on a spawned process calls os/proc-close which calls os/proc-wait and also closes pipes....

I'm not seeing pipe closure.

sogaiu commented 5 months ago

I'm trying to reproduce based on modifying the code samples in the first post above and not having much luck.

I replaced with with let in the Janet code as well as eliminating the timeout but I'm not so sure what program I should use (ls didn't lead to a successful reproduction here).

I presume that "the spawned process ends up devouring a CPU core" means something like CPU utilization goes to 100%. That's not something I'm seeing here, so I'm guessing that I'm not using appropriate code. Though if it's something else, please let me know.

Would you mind posting code that demonstrates the issue - preferably with programs that are likely to be installed on a typical Linux machine?

Possibly if I can reproduce the issue here there's a chance I might be able to observe what actually happens using gdb or record what happens using rr.

amano-kenji commented 5 months ago

Those are minimal examples that reproduce 100% CPU core usage.

(defn dump
  []
  (let [proc (os/spawn ["ls"] :p {:out :pipe :err :pipe})]
    (print (ev/read (proc :out) :all nil 5))))

(dump)
(forever
  (ev/sleep 1))

(defn dump
  []
  (let [proc (os/spawn ["ls"] :p {:out :pipe :err :pipe})]
    (os/proc-wait proc)))

(dump)
(forever
  (ev/sleep 1))

(defn dump
  []
  (let [proc (os/spawn ["ls"] :p {:out :pipe :err :pipe})]
    (ev/with-deadline 5
      (os/proc-wait proc))))

(dump)
(forever
  (ev/sleep 1))

If you replace let with with, 100% CPU core usage goes away. If I replace ev/sleep with os/sleep, 100% CPU core usage goes away.

I think the combination of ev/sleep and unclosed os/spawn is driving CPU core usage. ev is trying to sleep, and os/spawn is respawning the event loop back to life constantly?

amano-kenji commented 5 months ago

(os/spawn ["ls"] :p {:out :pipe})

(forever
  (ev/sleep 1))

leads to 100% cpu core usage.

(os/spawn ["sleep" "100000000"] :p {:out :pipe})

(forever
  (ev/sleep 1))

does not.

(os/spawn ["ls"] :p)

(forever
  (ev/sleep 1))

does not, either.

amano-kenji commented 5 months ago

My conclusion is that ev cannot sleep if a spawned process with IO pipes exited, but was not closed.

Because ev keeps trying to sleep, it may lead to 100% cpu core usage. The 100% cpu core usage is probably ev's way of screaming that it cannot sleep.

bakpakin commented 5 months ago

Will take more of a look at this soon, but this looks to be more related to the event loop not handling and event that keeps trigger, resulting in a busy loop. Namely, ls writes data that you never read, while sleep does nothing so there is no data to read.

Closing stuff has no relevance here.

bakpakin commented 5 months ago

Using 'strace janet test.janet' should better illustrate what is happening to cause a busy loop

amano-kenji commented 5 months ago

Here it is. strace.txt

amano-kenji commented 5 months ago

(def proc (os/spawn ["ls"] :p {:out :pipe}))
(print (ev/read (proc :out) :all))

(forever
  (ev/sleep 1))

also leads to 100% cpu core usage. To remove a busy loop, replace ev/sleep with os/sleep, remove {:out :pipe}, or close proc.

bakpakin commented 5 months ago

I see that you are using poll instead of epoll, which is a nonstandard build on Linux. Using epoll should help here - how did you build Janet and what version are you using?

amano-kenji commented 5 months ago

I'm not using poll. I used meson to build janet-1.33.0. I did not change the default meson options.

amano-kenji commented 5 months ago

By the way, is garbage collector going to correctly deal with this?

(def proc (os/spawn ["ls"] :p {:out :pipe}))
(print (ev/read (proc :out) :all))

(forever
  (ev/sleep 1))

or

(os/spawn ["ls"] :p {:out :pipe})

(forever
  (ev/sleep 1))

I just want to use os/spawn and throw it away. (os/spawn ["ls"] :p {:out :pipe}) is just a convenient way to send standard output to /dev/null.

bakpakin commented 5 months ago

It seems like it, I can see in the strace log. It's a configure build option that for a while was accidentally a default with the meson build. It should show epoll_wait in a loop instead of poll,poll,poll,etc.

bakpakin commented 5 months ago

Your comment about using is/spawn to just run something and forget about it is just not how it works, for a number of reasons. And using :pipe as /dev/null is also a bad idea. After the pipe buffer fills up, your program will hang.

If you want to run and forget, use os/execute. If you want it to wait in the background, wrap it with ev/spawn.

amano-kenji commented 5 months ago

But, os/execute doesn't just send standard output and standard error to /dev/null.

Janet doesn't come with devnull file as a variable.

What if I want to run and forget about something that runs for a few milliseconds?
What if I want to collect standard output of a process asynchronously while discarding its standard error?
- Should I drain 1024 byte chunks of standard error pipe and discard them in ev/spawn?

bakpakin commented 5 months ago

You can open dev null and use that, or read and discard in a loop.

There are some utils here that illustrate it: https://github.com/janet-lang/spork/blob/master/spork/sh.janet

Currently having internet trouble so I'm on a phone, but will look into the busy cpu loops later. That is certainly a bug with the poll backend.

To use epoll, rebuild Janet with meson and -Depoll=true and see if that helps.

EDIT: I think the exec-slurp-all function might do what you want

amano-kenji commented 5 months ago

I think the following commands fixed 100% cpu core usage.

meson setup -Depoll=true
meson compile
./janet test.janet

Perhaps, epoll should not even be an option?

amano-kenji commented 5 months ago

Neither exec-slurp or exec-slurp-all closes IO pipe streams.

Perhaps, is it okay to not close proc or IO pipe streams?

And using :pipe as /dev/null is also a bad idea. After the pipe buffer fills up, your program will hang.

Then, exec-slurp-all can be used to discard both stdout and stderr?

amano-kenji commented 5 months ago

According to my test

(os/spawn ["ls"] :p {:out :pipe :err :pipe})

(forever
  (ev/sleep 1))

led to 100% cpu usage. However, this didn't.

(import spork/sh)

(os/spawn ["ls"] :p {:out (sh/devnull) :err (sh/devnull)})

(forever
  (ev/sleep 1))

(doc os/spawn) doesn't tell me I could pass a stream to :out and :err.

If I call (os/spawn ["ls"] :p {:out (sh/devnull) :err (sh/devnull)}) many times, won't it make janet hang?

amano-kenji commented 5 months ago

I also found a weird thing. I discovered that this is possible.

(import spork/sh)

(os/execute ["ls"] :p {:out (sh/devnull) :err (sh/devnull)})

(doc os/execute) didn't tell me that I could pass a stream to :out or :err.

bakpakin commented 5 months ago

os/execute doesn't work with files - note that devnull uses os/open rather than file/ open. These are in Janet what are called Streams and are just wrappers around Unix file descriptors. It corresponds to calling open() in C.

bakpakin commented 5 months ago

Os/execute takes all of the same arguments as os/spawn

amano-kenji commented 5 months ago

I discovered that os/spawn and os/execute can accept core/file and core/stream for :err and :out.

amano-kenji commented 5 months ago

Tell me which of the following items will lead to problems.

(forever (ev/sleep 1) (:close (os/spawn ["ls"] :p {:out :pipe :err :pipe})))
(forever (ev/sleep 1) (:close (os/spawn ["ls"] :p {:out (sh/devnull) :err (sh/devnull)})))
(forever (ev/sleep 1) (os/spawn ["ls"] :p {:out :pipe :err :pipe}))
(forever (ev/sleep 1) (os/spawn ["ls"] :p {:out (sh/devnull) :err (sh/devnull)}))
(forever (ev/sleep 1) (os/execute ["ls"] :p {:out (sh/devnull) :err (sh/devnull)}))

bakpakin commented 5 months ago

This is an issue tracker, not a chatroom. I think we are getting off topic and I feel like I'm fielding random questions in a thread. Are you trying to fix your problem or understand what is going on with each example program? It's hard to track things when you have posted like 20 different variations of a program just a short period. Please have a goal in mind.

The 100% cpu usage is certainly an issue but most likely mostly related to the event loop implementation and poll. In either case, your program is not ideal.

A few things:

you don't need ev/spawn-thread. This is just going to cause you headaches. Use ev/spawn to run a background task unless the background task makes a blocking call. Neither os/execute not os/spawn block. I write medium sized, complete programs without using threads at all if I can avoid it.
don't redirect output to pipes that are never read from. This causes things to hang in any language. It's how pipes work on Unix likes and most languages work this way.
look to the sh.janet examples. They are written that way for a reason, using ev/gather to avoid race conditions. It's surprisingly tricky to get this correct - this is why Python has a function subprocess.communicate to just "do the IO" after spawning a process.
So I misspoke when I said os/execute doesn't use files. It can use both files and streams for :in , :out, and :err.
if you use os/spawn, I would always be sure to use os/proc-wait .

For your example, I would really do something like (sh/exec-slurp-all "my-program" "arg1") or just use (os/execute). os/spawn is mean t for long running subprocess thag you want to monitor and interact with, for example piping data to the processes stdin.

amano-kenji commented 5 months ago

The issue is confusion around how os/spawn should be cleaned up. After confusion clears up, documentation needs to improve. Let me summarize my understanding.

It's okay to not close proc returned by os/spawn or pipe streams created by os/spawn, but I must drain pipe streams created by os/spawn through (ev/read stream :all) or repetition of (ev/read stream n) until ev/read returns nil. I suspect both os/spawn and janet objects representing pipe streams are digested by garbage collector later.
It's okay to just throw (sh/devnull) to :in, :out and :err of os/spawn and os/execute. Garbage collector will digest (sh/devnull) and os/spawn later.

Why would I want to always use os/proc-wait against os/spawn? To prevent garbage collector from destroying os/spawn before it finishes? Isn't it sufficient to wait for (ev/read (proc :out) :all)?

In my j3blocks cmd module, if ev/read on pipe stream of a long-running process returns nil, I call os/proc-wait for process return code. If os/proc-wait on an ephemeral command returns a return code, I call (ev/read out :all). Can this lead to a race condition?

bakpakin commented 5 months ago

I'm not putting all of the nuance here into the doctoring for os/spawn.

Why would I want to always use os/proc-wait against os/spawn? To prevent garbage collector from destroying os/spawn before it finishes? Isn't it sufficient to wait for (ev/read (proc :out) :all)?

Nothing to do with the garbage collector. I don't know why you are so focused on the garbage collector, frankly it's not relevant to any of this. It will make a best effort to clean up resources but you don't really know when it will run.

The reason you call os/proc-wait is to avoid zombies. Same as any scripting language - if you want more info on this, read the man pages for waitpid(2).

Also notice how in sh.janet, is/proc-wait and ev/read run in parallel.

As far as race conditions, I was mainly talking about the general case - depending on what program you run, some things will work, some won't. Programs like 'sed' that incrementally read from stdin and then output text in no particular manner can do this quite easily. There are a number of other bugs in the issue tracker where we figured this stuff out and made things work reliably with the patterns in sh.janet.

As for "why" it works like this, the answer is simply because it's how POSIX works. os/spawn corresponds to posix_spawn and os/proc-wait corresponds to waitpid.

So is that enough? I think there are a couple of solutions here:

use sh.janet exec-slurp or exec-slurp-all
modify the above functions to be similar to your current function.

amano-kenji commented 5 months ago

As I was writing my own code, I discovered that exec-slurp and exec-slurp-all often don't give me enough control.

So, I have a few questions.

Is it safe to pass (sh/devnull) to stdout and stderr of ox/execute and os/spawn?
Were there actually race conditions where I cannot execute ev/read after os/proc-waitor os/proc-wait after ev/read? For long-running processes, I cannot wait for both ev/read and ev/proc-wait because ev/proc-wait will stall ev/read for hours. For ephemeral processes, I can. Can you describe what actually causes race conditions?

If I have answers to those questions, then I think I or someone else can maybe submit a pull request to improve documentation. The current documentation for os/execute and os/spawn omits important details and is actually wrong.

bakpakin commented 5 months ago

Is it safe to pass (sh/devnull) to stdout and stderr of ox/execute and os/spawn?

Yes.

For long-running processes, I cannot wait for both ev/read and ev/proc-wait because ev/proc-wait will stall ev/read for hours

I'm not sure where you get this idea. Noting is being "stalled". And what is wrong with waiting until the process is complete? Ev/read doesn't block anything if you wrap it with ev/spawn, you can run it on its own fiber. That is what ev/gather does. The general issue is that a subprocess that is writing to a pipe will get stuck if the pipe is not emptied and fills up. So anytime you redirect :out or :err to a pipe and don't read it, you can get this hanging issue. Other than that, the race condition I was originally thinking about has more to do with when there is also something pipe to stdin of the subprocess. I don't think it applies here if you just want the output. So if you call proc-wait and only after try to read the output, that generally won't work.

Here is some more example code that sets up and handles a long running sub process. Different from your use case I think, but has similar structure: https://github.com/janet-lang/spork/blob/7a4eff4bfb9486a6c6079ee8bb12e6789cce4564/spork/tasker.janet#L98

As far as actually being race conditions in your code, I don't know. I'm just trying to caution you since you seem to be unsure.

amano-kenji commented 5 months ago

I'm not sure where you get this idea. Noting is being "stalled". And what is wrong with waiting until the process is complete?

My j3blocks module has to respond to each line from a wireplumber script which can run for hours.

If it had to wait for the wireplumber script to exit, my swaybar would not show me updates to pipewire nodes for hours.

I want pipewire updates to be shown to me immediately. Because I want to read lines from a wireplumber script, I am forced to call os/proc-close after ev/read returns nil. ev/read returns nil when the wireplumber script exits. The wireplumber script exits when I restart wireplumber and pipewire.

I was originally thinking about has more to do with when there is also something pipe to stdin of the subprocess. I don't think it applies here if you just want the output. So if you call proc-wait and only after try to read the output, that generally won't work.

First of all, exec-slurp and exec-slurp-all return strings. They don't return pipe streams that can be fed to stdin of another subprocess. If you give stdout pipe of a subprocess to stdin of another subprocess, then you are not going to call ev/read or ev/write yourself.

I think calling os/proc-wait before ev/read doesn't hurt. If ev/with-deadline wraps a with block that calls os/proc-wait before ev/read, then any error caused by ev/with-deadline will cause with block to close the pipes. If os/proc-wait is called after (ev/read (proc :out) :all), then a zombie process may block os/proc-close because os/proc-close calls os/proc-wait if it hasn't been called already.

You eliminate this calculation by calling them together in ev/gather, but if you are going to call an ephemeral command that quits right away, then you can either call ev/gather or call os/proc-wait before ev/read.

If you know what you are doing, then you are not required to use ev/gather, I guess.

amano-kenji commented 5 months ago

I'm think I'm ready to submit a pull request for this issue.

As you said, I didn't need threads in most cases. Now, my j3blocks program uses ev/spawn-thread only for reading lines from standard input which is still a file.

Further discussion will happen in pull request.

amano-kenji commented 5 months ago

After days of working with janet subprocess API, I finally understood. I'm now a subprocess master.

Because the pipe buffer is limited, if the subprocess output is huge, calling ev/read after os/proc-wait finishes can block the subprocess from printing more on stdout. However, if (ev/read stream 1024) is called in a loop or (ev/read stream :all) is called once, then os/proc-wait can be called after (ev/read stream 1024) returns nil or (ev/read stream :all) returns. This is what you meant by race condition.

I think this should be documented in os/spawn and os/proc-wait. Not every programmer understands subprocess well. I will prepare a pull request soon.

sogaiu commented 5 months ago

There seems to be a fair number of noteworthy tidbits in this issue. May be we can make homes for some of them.

As for specifics, some of what's in this comment (may be there is some overlap with what's mentioned immediately above the present comment) might be nice to have somewhere too.

amano-kenji commented 5 months ago

May be we can make homes for some of them.

The janet website can have a section about subprocess management, but I improved documentation on subprocess API.

sogaiu commented 5 months ago

Perhaps a page for subprocess management between The Event Loop page and the Multithreading page could work.

I'm ok to participate in creating one if that's a good path forward. I don't think I understand all of the relevant details though, so likely I'll need to get up to speed on various things. I'll make an issue at the janet-lang.org repository about creating such a page.

amano-kenji commented 5 months ago

The pull request above explains details you need to know.

sogaiu commented 5 months ago

Thanks for pointing that out.

I've included mention of it in the newly created issue.

janet-lang / janet

Document how `os/spawn` should be cleaned up or handle zombie processes... #1386