Open erszcz opened 2 years ago
@erszcz - your reproduction case has an uninitialized variable in it, which is what causes it to fail on my machine (a hint is that it fails with "invalid argument" not "permission denied"):
::sched_param sp; // sp is uninit here
auto sched_ok = pthread_setschedparam(thread, SCHED_FIFO, &sp); // and passed to schedparam here
When I fix that (changing it to ::sched_param sp{1};
) I am able to successfully run sched
inside docker (as long as --cap-add=sys_nice
is set, as you've done). Can you try it again with this fix?
@travisdowns Thanks for spotting that, my bad. With this fix I managed to get the policy set correctly in a container. I'll followup with more details on our OpenShift Redpanda setup as I gather them.
Thanks Radek, looking forward to what you find!
@erszcz can you close this tix if it solves your problem. would be great to get your findings w/ open shift noted here too :)
- your reproduction case has an uninitialized variable in it,
nice @travisdowns I had reproduced this locally and was totally baffled!
Heh, me too: when it still failed outside of docker on bare metal, I reflexively ran it under valgrind, which I still find useful even in the face of ASAN since it can catch unitialized reads, which ASAN can't:
@VadimPlh you might consider running your failing test with alien thread under valgrind. I think that I had at one point suggested valgrind but wasn't sure if it was necessary given our use of clang sanitizers. It sounds like it might be.
I had spent more time experimenting with this and cannot get SCHED_FIFO set in a pod in either GKE or OpenShift. I imagine the same would apply to Redpanda, given my test program is actually extracted from Redpanda source code.
The test program:
// See Seastar - a Redpanda component - code at
// https://github.com/vectorizedio/seastar/blob/f8ec733c36f0829d56a17103c916154a946128be/src/core/reactor_backend.cc#L705
//
//void reactor_backend_epoll::start_tick() {
// _task_quota_timer_thread = std::thread(&reactor_backend_epoll::task_quota_timer_thread_fn, this);
// ::sched_param sp;
// sp.sched_priority = 1;
// auto sched_ok = pthread_setschedparam(_task_quota_timer_thread.native_handle(), SCHED_FIFO, &sp);
// if (sched_ok != 0 && _r._id == 0) {
// seastar_logger.warn("Unable to set SCHED_FIFO scheduling policy for timer thread; latency impact possible. Try adding CAP_SYS_NICE");
// }
//}
#include <cstdio>
#include <cstdlib>
#include <err.h>
#include <pthread.h>
#include <unistd.h>
void* run(void*) {
// sleep is a good enough sync mechanism for a simple test
usleep(200);
printf("I'm the child thread\n");
return NULL;
}
int main() {
pthread_t thread;
auto create_ok = pthread_create(&thread, NULL, run, NULL);
if (create_ok != 0) {
err(create_ok, "Cannot create thread");
}
printf("Created child thread: %d\n", thread);
::sched_param sp{1};
auto sched_ok = pthread_setschedparam(thread, SCHED_FIFO, &sp);
if (sched_ok != 0) {
err(sched_ok, "Unable to set SCHED_FIFO");
}
printf("Successfully set scheduling policy SCHED_FIFO\n");
pthread_join(thread, NULL);
}
Dockerfile:
FROM vectorized/redpanda:v21.10.2
USER root
RUN apt-get update \
&& apt-get install -y build-essential gcc g++ libcap2-bin \
&& apt-get clean \
&& rm -rf /var/cache/apt/archives
COPY sched.cc sched.cc
RUN CXXFLAGS="-std=c++11 -pthread" make sched
ENTRYPOINT /sched
K8s manifest:
apiVersion: v1
kind: Pod
metadata:
name: sched-test
spec:
containers:
- name: sched-test
image: ghcr.io/erszcz/sched-test:latest
command: ["bash", "-c", "sleep 3600s"]
securityContext:
runAsUser: 0
runAsGroup: 0
capabilities:
add:
- SYS_NICE
ghcr.io/erszcz/sched-test:latest
is available publicly.
The GKE cluster I'm testing in is running Kubernetes 1.21.6-gke.1500 with worker nodes' image type "Container-Optimized OS with Docker (cos)". The test:
$ k -n sched-test apply -f sched-test.yml
$ k -n sched-test exec -it sched-test bash
root@sched-test:/# uname -a
Linux sched-test 5.4.144+ #1 SMP Wed Nov 3 09:56:10 PDT 2021 x86_64 GNU/Linux
root@sched-test:/# ./sched
Created child thread: 801761024
sched: Unable to set SCHED_FIFO: Operation not permitted
root@sched-test:/# ./sched
Created child thread: 431077120
sched: Unable to set SCHED_FIFO: Operation not permitted
root@sched-test:/# capsh --print | grep Current: | grep _nice
Current: cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_sys_nice,cap_mknod,cap_audit_write,cap_setfcap=eip
Google Cloud Security Overview points to K8s official Set the security context for a Pod and Set capabilities for a Container, which I'm trying to follow above. According to capsh --print
, the capability is correctly set.
For comparison, running the test program in a GCP VM (not a Kubernetes worker node) succeeds:
$ gcloud beta compute ssh ...elided...
erszcz@sched-test-1:~$ uname -a
Linux sched-test-1 4.19.0-18-cloud-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 GNU/Linux
erszcz@sched-test-1:~$ sudo su
root@sched-test-1:/home/erszcz# docker run --cap-add=sys_nice --rm --user root ghcr.io/erszcz/sched-test:latest
Created child thread: -2063386880
Successfully set scheduling policy SCHED_FIFO
I'm the child thread
root@sched-test-1:/home/erszcz#
Do you have any suggestion on what might be wrong?
@erszcz I took a look, this seems to be happening due to a Kubernetes permissions issue.
Version & Environment
Redpanda version: (use
rpk version
): official vectorized/redpanda:v21.10.2 container imageWhat went wrong?
We're running Redpanda as a container in OpenShift. We're getting logs from
seastar
about inability to setSCHED_FIFO
:What should have happened instead?
We expect Redpanda NOT to log the above message.
We've tried enabling the container to use CAP_SYS_NICE gated capabilities as suggested in the log message itself, or issues https://github.com/scylladb/scylla-operator/issues/107 and https://github.com/scylladb/seastar/issues/382, but the problem persists.
How to reproduce the issue?
I tried creating a test program to sanity check cloud / container based envs for the required capabilities, but I cannot get the warning to go away even in a local container, even though the capability is correctly enabled as evidenced by
capsh --print
. The steps to reproduce are here - https://gist.github.com/erszcz/5ceca0866df5748f9a3dda7654467f2d.Questions
Unable to set SCHED_FIFO
message logged? If so, could you suggest steps to fix the sanity-check program provided in the gist above or to enable the capability correctly?JIRA Link: CORE-819