Closed tchaikov closed 1 year ago
looks like an issue related to kernel 5.11 [0]. i am not able to reproduce it with exactly the same docker image on linux 6.0.0-5-amd64 .
testing with kernel 5.11, i've repeated the test for 613 times. still not able to reproduce it.
[0] https://discuss.circleci.com/t/linux-machine-executor-images-2022-january-q1-update/42831
ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to write to stdin: std::system_error (error system:32, Broken pipe)
ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to read from stdout: std::system_error (error system:32, Broken pipe)
unknown location(0): fatal error: in "test_spawn_input": std::system_error: Broken pipe
now that a4e1508ba5fb8799ff530d1a22f02e34a84a1e6f has been merged. i am closing this issue. will reopen it if this failure surfaces again.
Hi, kefu. I'm also curious about the 'Broken pipe' issue and have made some investigations last weekend.
Since the error is caused by cat
closing the pipe fd prematurely, which shouldn't have happened in our simple test, but was not the case indeed. So I was assured that something went wrong with cat
, and I guessed that cat
itself may leave some clues to us and then I added some code to read the stderr
of the subprocess after stdin.flush()
:
return stdin.write(text).then([&stdin] {
return stdin.flush();
}).finally([&stderr]() {
return stderr.read_up_to(1024).then([](temporary_buffer<char> err) {
seastar_logger.error("read error from /bin/cat: {}", std::string(err.begin(), err.size()));
return make_ready_future<>();
});
});
And luckily I reproduced the issue on my machine(thank God!):
% for i in `seq 65535`; do ./build/debug/tests/unit/spawn_test | grep -i 'Broken pipe' && break; done;
WARNING: debug mode. Not for benchmarking or production
INFO 2023-03-13 02:56:52,063 seastar - Reactor backend: linux-aio
WARN 2023-03-13 02:56:52,084 [shard 6] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 5] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 4] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 2] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 9] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 11] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 10] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 13] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 7] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 8] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 12] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 3] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN 2023-03-13 02:56:52,084 [shard 14] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
INFO 2023-03-13 02:56:52,118 [shard 0] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO 2023-03-13 02:56:52,119 [shard 0] seastar - IO queue uses 0.75ms latency goal for device 0
INFO 2023-03-13 02:56:52,120 [shard 0] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO 2023-03-13 02:56:52,120 [shard 0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
ERROR 2023-03-13 02:56:52,264 [shard 0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(111): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)
*** No errors detected
ERROR 2023-03-13 02:56:52,264 [shard 0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable
And after searching google, I found several similar issue reports and this one gave me a hint:
Node.js sets its stdio pipe ends into non-blocking mode to work with async I/O, but cat expects blocking pipes.
We did make pipe fds non-blocking!
Maybe we should set fds used in subprocess blocking by default and leave the right to enable non-blocking to the child itself?
int nonblocking = 0;
std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cout_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cerr_pipe).ioctl(FIONBIO, &nonblocking);
@balusch hi Jianyong, thank you very much for looking into this issue! both your analysis and fix make sense to me. so i took the liberty of creating a PR based on your proposal -- the only contribution from me is to verify the fix and adapt your reasoning to a commit message. could you help review it?
...
WARN 2023-03-15 11:16:33,958 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARNING: unable to mbind shard memory; performance may suffer: Operation not permitted
WARN 2023-03-15 11:16:34,002 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
random-seed=2719424098
ERROR 2023-03-15 11:16:34,408 [shard 0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(108): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [ != hello world
]
*** 2 failures are detected in the test module "Master Test Suite"
also reproduced on my machine after an era, but fotunatelly we have some clues this time -- yes, the same error message as before, so I think the problem is still caused by the nonblockingness of pipe fds used by the child, and we just didn't fix it correctly.
I start to guess maybe the fds used by child are still nonblocking, which I cannot check since /bin/cat is a utility without source code, so I write my own version:
// mycat.c
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>
static void test_nonblocking();
int
main(int argc, char **argv)
{
test_nonblocking();
return 0;
}
static void
test_nonblocking()
{
#define TEST(desc, fd) \
do { \
int status = fcntl(fd, F_GETFL, 0); \
assert(status >= 0); \
if (status & O_NONBLOCK) { \
printf(#desc " is nonblocking\n"); \
} else { \
printf(#desc " is blocking\n"); \
} \
} while (0)
TEST(stdin, STDIN_FILENO);
TEST(stdout, STDOUT_FILENO);
TEST(stderr, STDERR_FILENO);
#undef TEST
}
it simply prints the blockingness of stdin/stdout/stderr through stdout, without echoing data back.
And replaced /bin/cat in test_spawn_input
, re-ran it and got the result:
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [stdin is non != hello world
although the message read from stdout of the child is incomplete, we still know it's nonblocking!
Finally I found the reason: we misused the file_desc::ioctl
method.
There are lots of versions of ioctl
in file_desc
:
int ioctl(int request) {
return ioctl(request, 0);
}
int ioctl(int request, int value) {
int r = ::ioctl(_fd, request, value);
throw_system_error_on(r == -1, "ioctl");
return r;
}
int ioctl(int request, unsigned int value) {
int r = ::ioctl(_fd, request, value);
throw_system_error_on(r == -1, "ioctl");
return r;
}
template <class X>
int ioctl(int request, X& data) {
int r = ::ioctl(_fd, request, &data);
throw_system_error_on(r == -1, "ioctl");
return r;
}
template <class X>
int ioctl(int request, X&& data) {
int r = ::ioctl(_fd, request, &data);
throw_system_error_on(r == -1, "ioctl");
return r;
}
what we need is the last two, and std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking);
have chosen the last version actually, but with data = &nonblocking != 0
, which is not what we intend to do.
So we could replace previous fix with
std::get<pipefd_read_end>(cin_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cout_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cerr_pipe).template ioctl<int>(FIONBIO, 0);
it should work.
actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: https://github.com/coreutils/coreutils/blob/master/src/cat.c
@balusch hi Jianyolng, could you prepare a patch? i think it's you who did all the heavy lift.
could you prepare a patch? i think it's you who did all the heavy lift.
Sure, my pleasure.
thank you!
actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: coreutils/coreutils@
master
/src/cat.c
I intended to read it last weekend, but suddenly thought that cat may leave us some information through stderr. 😃
not able to reproduce this failure locally with