scylladb / seastar

High performance server-side application framework
http://seastar.io
Apache License 2.0
8.38k stars 1.55k forks source link

test_spawn_input fails randomly #1320

Closed tchaikov closed 1 year ago

tchaikov commented 1 year ago
./run ./configure.py --compiler g++-12 --c++-standard 20 ./run ninja -C build/debug ./run ./test.py --mode=debug 
...
72/74 Test #72: Seastar.unit.spawn ............................***Failed    1.10 sec
[0/1] cd /home/circleci/project/build/debug/tests/unit && /home/circleci/project/build/debug/tests/unit/spawn_test -- -c 2
Running 5 test cases...
WARNING: debug mode. Not for benchmarking or production
INFO  2022-11-30 06:52:59,248 seastar - Reactor backend: linux-aio
WARN  2022-11-30 06:52:59,259 [shard 0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
WARN  2022-11-30 06:52:59,259 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
INFO  2022-11-30 06:52:59,273 [shard 1] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO  2022-11-30 06:52:59,275 [shard 1] seastar - IO queue uses 0.75ms latency goal for device 0
INFO  2022-11-30 06:52:59,277 [shard 1] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO  2022-11-30 06:52:59,278 [shard 0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
random-seed=4260918692
unknown location(0): fatal error: in "test_spawn_input": std::system_error: Broken pipe
/home/circleci/project/src/testing/seastar_test.cc(43): last checkpoint

not able to reproduce this failure locally with

$ for i in `seq 1000`; do
  tests/unit/spawn_test -- -c 2 || break
done
tchaikov commented 1 year ago

looks like an issue related to kernel 5.11 [0]. i am not able to reproduce it with exactly the same docker image on linux 6.0.0-5-amd64 .

testing with kernel 5.11, i've repeated the test for 613 times. still not able to reproduce it.


[0] https://discuss.circleci.com/t/linux-machine-executor-images-2022-january-q1-update/42831

tchaikov commented 1 year ago
ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to write to stdin: std::system_error (error system:32, Broken pipe)
ERROR 2022-12-09 16:21:09,318 [shard 0] testlog - failed to read from stdout: std::system_error (error system:32, Broken pipe)
unknown location(0): fatal error: in "test_spawn_input": std::system_error: Broken pipe

see https://app.circleci.com/pipelines/github/tchaikov/seastar/289/workflows/b88e8c00-45f5-49b2-99ce-8036b372b728/jobs/3438

tchaikov commented 1 year ago

now that a4e1508ba5fb8799ff530d1a22f02e34a84a1e6f has been merged. i am closing this issue. will reopen it if this failure surfaces again.

tchaikov commented 1 year ago

https://app.circleci.com/pipelines/github/scylladb/seastar/1656/workflows/6ad4b28f-1573-4a68-a75b-532f5cdc933d/jobs/6629/parallel-runs/0/steps/0-105

balusch commented 1 year ago

Hi, kefu. I'm also curious about the 'Broken pipe' issue and have made some investigations last weekend.

Since the error is caused by cat closing the pipe fd prematurely, which shouldn't have happened in our simple test, but was not the case indeed. So I was assured that something went wrong with cat, and I guessed that cat itself may leave some clues to us and then I added some code to read the stderr of the subprocess after stdin.flush():

return stdin.write(text).then([&stdin] {
    return stdin.flush();
}).finally([&stderr]() {
    return stderr.read_up_to(1024).then([](temporary_buffer<char> err) {
        seastar_logger.error("read error from /bin/cat: {}", std::string(err.begin(), err.size()));
        return make_ready_future<>();
    });
});

And luckily I reproduced the issue on my machine(thank God!):

% for i in `seq 65535`; do ./build/debug/tests/unit/spawn_test | grep -i 'Broken pipe' && break; done;
WARNING: debug mode. Not for benchmarking or production
INFO  2023-03-13 02:56:52,063 seastar - Reactor backend: linux-aio
WARN  2023-03-13 02:56:52,084 [shard  6] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  5] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  4] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  2] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  9] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 11] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 10] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 13] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  7] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  8] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 12] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard  3] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARN  2023-03-13 02:56:52,084 [shard 14] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
INFO  2023-03-13 02:56:52,118 [shard  0] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
INFO  2023-03-13 02:56:52,119 [shard  0] seastar - IO queue uses 0.75ms latency goal for device 0
INFO  2023-03-13 02:56:52,120 [shard  0] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
INFO  2023-03-13 02:56:52,120 [shard  0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
ERROR 2023-03-13 02:56:52,264 [shard  0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(111): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)

*** No errors detected

ERROR 2023-03-13 02:56:52,264 [shard 0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

And after searching google, I found several similar issue reports and this one gave me a hint:

Node.js sets its stdio pipe ends into non-blocking mode to work with async I/O, but cat expects blocking pipes.

We did make pipe fds non-blocking!

Maybe we should set fds used in subprocess blocking by default and leave the right to enable non-blocking to the child itself?

int nonblocking = 0;
std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cout_pipe).ioctl(FIONBIO, &nonblocking);
std::get<pipefd_write_end>(cerr_pipe).ioctl(FIONBIO, &nonblocking);
tchaikov commented 1 year ago

@balusch hi Jianyong, thank you very much for looking into this issue! both your analysis and fix make sense to me. so i took the liberty of creating a PR based on your proposal -- the only contribution from me is to verify the fix and adapt your reasoning to a commit message. could you help review it?

balusch commented 1 year ago
...
WARN  2023-03-15 11:16:33,958 [shard  1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
WARNING: unable to mbind shard memory; performance may suffer: Operation not permitted
WARN  2023-03-15 11:16:34,002 [shard 15] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:1, perf_event_open() failed: Operation not permitted)
random-seed=2719424098
ERROR 2023-03-15 11:16:34,408 [shard  0] seastar - read error from /bin/cat: /bin/cat: -: Resource temporarily unavailable

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(108): error: in "test_spawn_input": failed to write to stdin: std::system_error (error system:32, Broken pipe)
/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [ != hello world
]

*** 2 failures are detected in the test module "Master Test Suite"

also reproduced on my machine after an era, but fotunatelly we have some clues this time -- yes, the same error message as before, so I think the problem is still caused by the nonblockingness of pipe fds used by the child, and we just didn't fix it correctly.

I start to guess maybe the fds used by child are still nonblocking, which I cannot check since /bin/cat is a utility without source code, so I write my own version:

// mycat.c

#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <assert.h>

static void test_nonblocking();

int
main(int argc, char **argv)
{
    test_nonblocking();
    return 0;
}

static void
test_nonblocking()
{

#define TEST(desc, fd)                              \
do {                                                \
    int status = fcntl(fd, F_GETFL, 0);             \
    assert(status >= 0);                            \
    if (status & O_NONBLOCK) {                      \
        printf(#desc " is nonblocking\n");          \
    } else {                                        \
        printf(#desc " is blocking\n");             \
    }                                               \
} while (0)

    TEST(stdin, STDIN_FILENO);
    TEST(stdout, STDOUT_FILENO);
    TEST(stderr, STDERR_FILENO);

#undef TEST
}

it simply prints the blockingness of stdin/stdout/stderr through stdout, without echoing data back.

And replaced /bin/cat in test_spawn_input, re-ran it and got the result:

/data/Workspace/iSoft/seastar/tests/unit/spawn_test.cc(117): error: in "test_spawn_input": check sstring(echo.get(), echo.size()) == text has failed [stdin is non != hello world

although the message read from stdout of the child is incomplete, we still know it's nonblocking!

Finally I found the reason: we misused the file_desc::ioctl method.

There are lots of versions of ioctl in file_desc:

    int ioctl(int request) {
        return ioctl(request, 0);
    }
    int ioctl(int request, int value) {
        int r = ::ioctl(_fd, request, value);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    int ioctl(int request, unsigned int value) {
        int r = ::ioctl(_fd, request, value);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    template <class X>
    int ioctl(int request, X& data) {
        int r = ::ioctl(_fd, request, &data);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }
    template <class X>
    int ioctl(int request, X&& data) {
        int r = ::ioctl(_fd, request, &data);
        throw_system_error_on(r == -1, "ioctl");
        return r;
    }

what we need is the last two, and std::get<pipefd_read_end>(cin_pipe).ioctl(FIONBIO, &nonblocking); have chosen the last version actually, but with data = &nonblocking != 0, which is not what we intend to do.

So we could replace previous fix with

std::get<pipefd_read_end>(cin_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cout_pipe).template ioctl<int>(FIONBIO, 0);
std::get<pipefd_write_end>(cerr_pipe).template ioctl<int>(FIONBIO, 0);

it should work.

tchaikov commented 1 year ago

actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: https://github.com/coreutils/coreutils/blob/master/src/cat.c

tchaikov commented 1 year ago

@balusch hi Jianyolng, could you prepare a patch? i think it's you who did all the heavy lift.

balusch commented 1 year ago

could you prepare a patch? i think it's you who did all the heavy lift.

Sure, my pleasure.

tchaikov commented 1 year ago

thank you!

balusch commented 1 year ago

actually, i was reading cat's source code couple months ago. but ended up being clueless and gave up: coreutils/coreutils@master/src/cat.c

I intended to read it last weekend, but suddenly thought that cat may leave us some information through stderr. 😃