seastar: Unable to set SCHED_FIFO

erszcz commented 2 years ago

Version & Environment

Redpanda version: (use rpk version): official vectorized/redpanda:v21.10.2 container image

What went wrong?

We're running Redpanda as a container in OpenShift. We're getting logs from seastar about inability to set SCHED_FIFO:

Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE

What should have happened instead?

We expect Redpanda NOT to log the above message.

We've tried enabling the container to use CAP_SYS_NICE gated capabilities as suggested in the log message itself, or issues https://github.com/scylladb/scylla-operator/issues/107 and https://github.com/scylladb/seastar/issues/382, but the problem persists.

How to reproduce the issue?

I tried creating a test program to sanity check cloud / container based envs for the required capabilities, but I cannot get the warning to go away even in a local container, even though the capability is correctly enabled as evidenced by capsh --print. The steps to reproduce are here - https://gist.github.com/erszcz/5ceca0866df5748f9a3dda7654467f2d.

Questions

Is it possible to run Redpanda in a container and NOT get the Unable to set SCHED_FIFO message logged? If so, could you suggest steps to fix the sanity-check program provided in the gist above or to enable the capability correctly?
Is it advised to always run Redpanda outside a container, on a dedicated VM or a dedicated machine?

JIRA Link: CORE-819

travisdowns commented 2 years ago

@erszcz - your reproduction case has an uninitialized variable in it, which is what causes it to fail on my machine (a hint is that it fails with "invalid argument" not "permission denied"):

 ::sched_param sp;  // sp is uninit here
 auto sched_ok = pthread_setschedparam(thread, SCHED_FIFO, &sp); // and passed to schedparam here

When I fix that (changing it to ::sched_param sp{1};) I am able to successfully run sched inside docker (as long as --cap-add=sys_nice is set, as you've done). Can you try it again with this fix?

erszcz commented 2 years ago

@travisdowns Thanks for spotting that, my bad. With this fix I managed to get the policy set correctly in a container. I'll followup with more details on our OpenShift Redpanda setup as I gather them.

travisdowns commented 2 years ago

Thanks Radek, looking forward to what you find!

emaxerrno commented 2 years ago

@erszcz can you close this tix if it solves your problem. would be great to get your findings w/ open shift noted here too :)

dotnwat commented 2 years ago

your reproduction case has an uninitialized variable in it,

nice @travisdowns I had reproduced this locally and was totally baffled!

travisdowns commented 2 years ago

Heh, me too: when it still failed outside of docker on bare metal, I reflexively ran it under valgrind, which I still find useful even in the face of ASAN since it can catch unitialized reads, which ASAN can't:

dotnwat commented 2 years ago

@VadimPlh you might consider running your failing test with alien thread under valgrind. I think that I had at one point suggested valgrind but wasn't sure if it was necessary given our use of clang sanitizers. It sounds like it might be.

erszcz commented 2 years ago

I had spent more time experimenting with this and cannot get SCHED_FIFO set in a pod in either GKE or OpenShift. I imagine the same would apply to Redpanda, given my test program is actually extracted from Redpanda source code.

The test program:

// See Seastar - a Redpanda component - code at
// https://github.com/vectorizedio/seastar/blob/f8ec733c36f0829d56a17103c916154a946128be/src/core/reactor_backend.cc#L705
//
//void reactor_backend_epoll::start_tick() {
//    _task_quota_timer_thread = std::thread(&reactor_backend_epoll::task_quota_timer_thread_fn, this);

//    ::sched_param sp;
//    sp.sched_priority = 1;
//    auto sched_ok = pthread_setschedparam(_task_quota_timer_thread.native_handle(), SCHED_FIFO, &sp);
//    if (sched_ok != 0 && _r._id == 0) {
//        seastar_logger.warn("Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE");
//    }
//}

#include <cstdio>
#include <cstdlib>
#include <err.h>
#include <pthread.h>
#include <unistd.h>

void* run(void*) {
        // sleep is a good enough sync mechanism for a simple test
        usleep(200);
        printf("I'm the child thread\n");
        return NULL;
}

int main() {
        pthread_t thread;
        auto create_ok = pthread_create(&thread, NULL, run, NULL);
        if (create_ok != 0) {
                err(create_ok, "Cannot create thread");
        }
        printf("Created child thread: %d\n", thread);

        ::sched_param sp{1};
        auto sched_ok = pthread_setschedparam(thread, SCHED_FIFO, &sp);
        if (sched_ok != 0) {
                err(sched_ok, "Unable to set SCHED_FIFO");
        }
        printf("Successfully set scheduling policy SCHED_FIFO\n");

        pthread_join(thread, NULL);
}

Dockerfile:

FROM vectorized/redpanda:v21.10.2

USER root

RUN apt-get update \
 && apt-get install -y build-essential gcc g++ libcap2-bin \
 && apt-get clean \
 && rm -rf /var/cache/apt/archives

COPY sched.cc sched.cc

RUN CXXFLAGS="-std=c++11 -pthread" make sched

ENTRYPOINT /sched

K8s manifest:

apiVersion: v1
kind: Pod
metadata:
  name: sched-test
spec:
  containers:
  - name: sched-test
    image: ghcr.io/erszcz/sched-test:latest
    command: ["bash", "-c", "sleep 3600s"]
    securityContext:
      runAsUser: 0
      runAsGroup: 0
      capabilities:
        add:
        - SYS_NICE

ghcr.io/erszcz/sched-test:latest is available publicly.

The GKE cluster I'm testing in is running Kubernetes 1.21.6-gke.1500 with worker nodes' image type "Container-Optimized OS with Docker (cos)". The test:

$ k -n sched-test apply -f sched-test.yml
$ k -n sched-test exec -it sched-test bash
root@sched-test:/# uname -a
Linux sched-test 5.4.144+ #1 SMP Wed Nov 3 09:56:10 PDT 2021 x86_64 GNU/Linux
root@sched-test:/# ./sched
Created child thread: 801761024
sched: Unable to set SCHED_FIFO: Operation not permitted
root@sched-test:/# ./sched
Created child thread: 431077120
sched: Unable to set SCHED_FIFO: Operation not permitted
root@sched-test:/# capsh --print | grep Current: | grep _nice
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap=eip

Google Cloud Security Overview points to K8s official Set the security context for a Pod and Set capabilities for a Container, which I'm trying to follow above. According to capsh --print, the capability is correctly set.

For comparison, running the test program in a GCP VM (not a Kubernetes worker node) succeeds:

$ gcloud beta compute ssh ...elided...
erszcz@sched-test-1:~$ uname -a
Linux sched-test-1 4.19.0-18-cloud-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
erszcz@sched-test-1:~$ sudo su
root@sched-test-1:/home/erszcz# docker run --cap-add=sys_nice --rm --user root ghcr.io/erszcz/sched-test:latest
Created child thread: -2063386880
Successfully set scheduling policy SCHED_FIFO
I'm the child thread
root@sched-test-1:/home/erszcz#

Do you have any suggestion on what might be wrong?

githubexplorer38237213271 commented 8 months ago

@erszcz I took a look, this seems to be happening due to a Kubernetes permissions issue.

redpanda-data / redpanda