saleyn / erlexec

Execute and control OS processes from Erlang/OTP
https://hexdocs.pm/erlexec/readme.html
Other
525 stars 139 forks source link

When native stdin process buffer is full, erlexec does not retry #157

Closed jpjodoin closed 2 years ago

jpjodoin commented 2 years ago

Symptom

I have an erlang application that will sometime write burst of data through stdout to a worker process using erlexec and it will then stop sending data for a while. In this scenario, it will sometime overload the stdin buffer of the worker process which will return EAGAIN. There doesn't seem to be a mecanism to retry in this scenario, so the remaining data will stay stuck in erlexec until the next time we write new data from erlang. At this point, the old data will be sent to the worker process (until we hit that eagain or empty the buffer). This can lead to scenario where after a long burst, we will be unable to process data for a while even when no longer under load since our worker will be spending his time processing old data that's no longer relevant.

Expected output

I would expect erlexec to retry writing in the stdin buffer until stdin_queue is empty (either immediately, after a delay or after checking if it received other command from the erlang app).

Environement

This issue was observed using the latest (master) version on Centos.

Thanks !

saleyn commented 2 years ago

Were you running with debug option set? If so, I committed a small fix that might have affected the issue. Please check.

Which reactor is used in your case (poll, select)? When the call to poll/select is made, the OS kernel should set the writable state in the file descriptor that is feeding data to your working process through its stdin. This is checked here, which is called repeatedly in the main loop.

So, if this is not happening, then either your worker process is not reading the data from its stdin, or there's some other reason/bug. I would need a small reproducible test case to check this, but based on my availability I wouldn't be able to debug this issue sooner than in a couple of months. If you could look into this and send me a PR, that would be best.

jpjodoin commented 2 years ago

Yes, I was running with the debug option. I will take a look at your last commit, thanks !

In my case, I'm using poll, but before that I was using select (I upgraded to newer version of erlexec with poll while looking at this issue). The reason I think my worker process is ready to read more data but not receiving it is that as soon as my erlang process write something, suddenly erlexec start writing all the previous data until the stdin buffer returns EAGAIN. This is what I'm seeing with strace at least. Thanks for pointing out where this is checked in the main loop, I will add more trace to see what's going on there.

Thanks for your help and I will see if I can provide either a small test case or a PR.

jpjodoin commented 2 years ago

Hello @saleyn, I've found some time to investigate this more on my side. Like you mention, the process_stream_data is indeed called at every loop. The issue comes from here. STDIN_FILENO is not included as part of the stream even though the stdin_queue has thousands of items in it. This is because of this . stdin_wr_pos is 0 so we never look at stdin in the main loop. Not sure why stdin_wr_pos > 0 is necessary here, but I've added a verification for the stdin queue if (stream_fd[STDIN_FILENO] >= 0 && (stdin_wr_pos > 0 || !stdin_queue.empty())) and that resolve the issue for me.

If that make sense to you, I can look into contributing a PR.

saleyn commented 2 years ago

Yes, please submit a PR.

jpjodoin commented 2 years ago

Perfect, will submit the PR when it clears our internal process. Thanks for your help.

saleyn commented 2 years ago

Thanks for troubleshooting this bug! It hasn't come up before.

saleyn commented 2 years ago

I updated the repository to include this fix. Please check.

jpjodoin commented 2 years ago

Hey, Sorry for the delay with our process. Your fix match what I did. I also did the same for select on my side (line 34), but I have only tested with poll. Considering I was using select initially when I found the issue, I'm pretty sure the same fix applies there. Cheers!

saleyn commented 2 years ago

Do you mind running your test with the latest commit? I slightly refactored that code.

jpjodoin commented 2 years ago

Hey @saleyn it seems like there's a mistake in that commit.

In the refactored version you are using append_write_fd for STDOUT and STDERR while you were using append_read_fd before. Once I changed that, it works with my test.

Cheers!

saleyn commented 2 years ago

Thanks. Committed the fix.