cannot start `sh': Resource temporarily unavailable

jtuttas commented 1 year ago

Hello, we are using JOBE for some year now, now we are switching to a new (virtual) Server. I start the JOBE Docker container like on the old Server!

But launching a Code Runner Question i've got the following error-message:

/var/www/html/jobe/application/libraries/../../runguard/runguard: cannot start `sh': Resource temporarily unavailable
Try `/var/www/html/jobe/application/libraries/../../runguard/runguard --help' for more information.

Did you have any idea what could go wrong ?

best regards

jtuttas

trampgeek commented 1 year ago

I have just this week become aware of a significant issue with jobeinabox if running it on a host that's providing other services. Although I haven't yet confirmed it, I think that can give rise to the error message you're quoting.

I've also seen that error message arising on CentOS and RHEL servers. Those OSs don't play well with Docker.

So, can you please advise what OS the host is running, what its configuration is (memory size, number of CPUs), and whether the host is providing other services as well as Jobe.

jtuttas commented 1 year ago

Hello,the docker host is an Ubuntu 20.04 VM with 6 Virtual CPUs and 10G Ram. The Hypervisor I think ist ProxMox.

beside the docker system is a native installed JOBE (that runs correct on a different port of course) and gitlab-ce ! But we want to switch from the native install JOBE Server to the Docker container, because of security reasons (otherwise for example a python program can read the host file system !)

trampgeek commented 1 year ago

OK, thanks for the info. Would you mind opening a shell inside the container, navigating to /var/www/html/jobe, and typing the following command, please:

python3 testsubmit.py

Please paste the entire output into your response or attach it as a file.

Also, could you let me know the output from the following two commands when executed on the host, please:

grep ":99" /etc/passwd

ps l -u "996,997,998,999"

trampgeek commented 1 year ago

As an aside, with regard to security issues, Jobe is pretty secure, and has had several security audits over the years. Is it really a problem that the jobs can read the file system? I'd hope that no sensitive information was in world-readable files.

A jobe task has significantly less rights than a logged-in user on the system (e.g. limited processes, memory, job time).

jtuttas commented 1 year ago

Hello, when i run python3 testsubmit.py I got a huge list auf failed Tests, like

***************** FAILED TEST ******************

{'run_id': None, 'outcome': 11, 'cmpinfo': "/var/www/html/jobe/application/libraries/../../runguard/runguard: cannot start `sh': Resource temporarily unavailable\nTry `/var/www/html/jobe/application/libraries/../../runguard/runguard --help' for more information.\n", 'stdout': '', 'stderr': ''}
Valid Python3
Jobe result: Compile error

Compiler output:
/var/www/html/jobe/application/libraries/../../runguard/runguard: cannot start `sh': Resource temporarily unavailable
Try `/var/www/html/jobe/application/libraries/../../runguard/runguard --help' for more information.

************************************************

jobe.txt

Attached is the complete report !

And this is the output of the other two commands:

root@e38922e4d00a:/var/www/html/jobe# grep ":99" /etc/passwd
jobe:x:999:999:Jobe user. Provides home for runs and files.:/home/jobe:/bin/false
jobe00:x:998:999:Jobe server task runner:/home/jobe00:/bin/false
jobe01:x:997:999:Jobe server task runner:/home/jobe01:/bin/false
jobe02:x:996:999:Jobe server task runner:/home/jobe02:/bin/false
jobe03:x:995:999:Jobe server task runner:/home/jobe03:/bin/false
jobe04:x:994:999:Jobe server task runner:/home/jobe04:/bin/false
jobe05:x:993:999:Jobe server task runner:/home/jobe05:/bin/false
jobe06:x:992:999:Jobe server task runner:/home/jobe06:/bin/false
jobe07:x:991:999:Jobe server task runner:/home/jobe07:/bin/false

root@e38922e4d00a:/var/www/html/jobe# ps l -u "996,997,998,999"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

Well and for the security issue, yes I believe that jobe is secure, but hard to say, if there are'nt any files with read permissions to the world. That's why I prefer to have everything inside a docker container !

trampgeek commented 1 year ago

This is very interesting. You're the second person reporting this problem and the symptoms are almost identical. Did you notice that the Java jobs ran fine? And that 3 of the 10 identical C jobs thrown at the server in quick succession passed, whereas the same job had failed multiple times earlier in the test run.

In discussion with the other guy reporting this problem I've identified the problem as being the process limit. Java uses a large process limit (ulimit NPROC) of several hundred whereas the default for most jobs is 30. It turns out that the C jobs (and probably all the others too) run fine if you raise the process limit. It also seems that some of the higher-numbered jobe users aren't affected but because Jobe users are allocated in order jobe00, jobe01 etc, the higher-numbered ones never get to run jobs unless you toss a lot of jobs at Jobe in a short space of time (which is why 3 of the 10 ran, or so I conjecture).

I had been theorising that because the user namespaces in the container are shared with the host (unless you're running the docker daemon with isolated user namespaces - do you know?), there must be processes running on the host with the same UID as that of the jobe users in the container. You do indeed have such users - the ones created by jobe on the host - but there's no way they can be clocking up sufficient processes to block the ones in the container (if that's even possible - I'm not sure how ulimits are enforced within the container).

I'm now frankly baffled and not sure what to suggest. The other guy is also running a host system of Ubuntu 20:04, but that should be fine. It could be significant that I only recently changed the base OS for jobeinabox from Ubuntu 20:04 to Ubuntu 22:04, but I wouldn't have thought that would matter. And I just now fired up jobeinabox on a 20:04 host with no problems.

Clutching at straws here but ... would you mind building your own jobeinabox image from scratch, please (see here) and confirming nothing changes, please? And then, to be quite sure, editing the Docker file to change the base OS back to Ubuntu 20:04 and build with that? But you'll also need to edit the line

openjdk-18-jdk \

to

openjdk-16-jdk \

I'd really like to get to the heart of this - two such reports in the space of week suggests something "interesting" has happened.

jtuttas commented 1 year ago

Ok I rebuild the Image from the Docker file, that I have downgraded to ubuntu 20:04 and jdk16. Unfortunately with no effect, se the output of python3 testsubmit.py !

jobe2.txt

trampgeek commented 1 year ago

Many thanks for that - it's very helpful to at least eliminate that possibility, but it doesn't leave us any closer to an explanation.

The dialogue with the other guy is here if you want to see where the issue is at. You'll see that they just switched to running jobeinabox on an AWS server.

I'm frankly baffled, with little to suggest. I can't debug if I can't replicate the problem.

You could perhaps check that the problem goes away if you change line 44 of jobe/application/libraries/LanguageTask.php from

'numprocs'      => 30,

to

'numprocs'      => 200,

But even if it does (as I expect), I'd be unhappy running a server with such an inexplicable behaviour. The tests should nearly all run with a value of 2.

The only thing I can see in common between your system and the other guys is that you're both running additional services in containers - they're running Moodle, you're running gitlab. Are you perhaps able to stop gitlab (and any other docker processes) and check Jobe again? This isn't quite as silly as it sounds - they do all share the same UID space.

Do you have any other suggestions yourself?

trampgeek commented 1 year ago

I've thought of one other test you could do, if you wouldn't mind, please?

In the container:

apt update
apt install nano
nano /etc/login.defs

Uncomment the line

#SYS_UID_MAX               999

and set the value to 800 instead.

Similarly uncomment and set SYS_GID_MAX to 800.

Then:

cd /var/www/html/jobe
./install --purge
cat /etc/passwd  # Confirm that UIDs are all <= 800.
python3 testsubmit.py

Does it work OK now?

jtuttas commented 1 year ago

I've try the first approach and set numprocs to 200 an this looks relay good. I only got 3 Errors ! I will continue with your second approach !

jobe3.txt

jtuttas commented 1 year ago

Now with yours second approach....

GREAT!!, No error was reportet (but I still have set numprocs to 200, shall I reduce it ?

jobe4.txt

trampgeek commented 1 year ago

Yes please - I'd like to be reassured that you still get no errors with it set back to 30.

Many thanks for your debugging support. This is definitely progress, of a sort. If you still get no errors, as I would hope, I'd be fairly confident that you have a working Jobe server. But I really would like to know why those top few system-level user IDs are causing problems. The implication is that something else in your system is using them too. And running lots of processes with them. But what? They're not being used by the host (we checked the password file and did a ps to be sure) so I can only assume that another container is using them.

Are you able to list all running containers (docker ps), then exec the following two commands in all of them to see if we can find the culprit?

grep ":99" /etc/passwd
ps l -u "995,996,997,998"

If that last command throws up a list of processes within any of the containers, we've found the problem!

jtuttas commented 1 year ago

Ok, reduces the numproc to 30 again, and have NO errors when i run python3 testsubmit.py.

I have 3 running docker containers. A gitlab runner, a self developed application and jobe, here are the results of the command:

gitlab runner:

root@1e7a7e4bf873:/# grep ":99" /etc/passwd
gitlab-runner:x:999:999:GitLab Runner:/home/gitlab-runner:/bin/bash
root@1e7a7e4bf873:/# ps l -u "995,996,997,998"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

Self developed Application

root@07b52853eb4b:/usr/src/app# grep ":99" /etc/passwd
root@07b52853eb4b:/usr/src/app# ps l -u "995,996,997,998"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

JOBE

root@abfa600fcac8:/# grep ":99" /etc/passwd
root@abfa600fcac8:/# ps l -u "995,996,997,998"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

jtuttas commented 1 year ago

By the way, when I rund these commands on the host system, I've got below's output. On the Host systems is installed gitlab and jobe! May be this could help too !

But I'am happy to have the jobeinbox running fine. It would be nice to know what I have to change in the Dockerfile, so that i can build a working image :-) !

root@mmbbs-dev:~/jobe# grep ":99" /etc/passwd
systemd-coredump:x:999:999:systemd Core Dumper:/:/usr/sbin/nologin
gitlab-www:x:998:998::/var/opt/gitlab/nginx:/bin/false
git:x:997:997::/var/opt/gitlab:/bin/sh
gitlab-redis:x:996:996::/var/opt/gitlab/redis:/bin/false
gitlab-psql:x:995:995::/var/opt/gitlab/postgresql:/bin/sh
registry:x:994:994::/var/opt/gitlab/registry:/bin/sh
gitlab-prometheus:x:993:993::/var/opt/gitlab/prometheus:/bin/sh
gitlab-runner:x:992:992:GitLab Runner:/home/gitlab-runner:/bin/bash
jobe:x:991:991:Jobe user. Provides home for runs and files.:/home/jobe:/bin/false
jobe00:x:990:991:Jobe server task runner:/home/jobe00:/bin/false
jobe01:x:989:991:Jobe server task runner:/home/jobe01:/bin/false
jobe02:x:988:991:Jobe server task runner:/home/jobe02:/bin/false
jobe03:x:987:991:Jobe server task runner:/home/jobe03:/bin/false
jobe04:x:986:991:Jobe server task runner:/home/jobe04:/bin/false
jobe05:x:985:991:Jobe server task runner:/home/jobe05:/bin/false
jobe06:x:984:991:Jobe server task runner:/home/jobe06:/bin/false
jobe07:x:983:991:Jobe server task runner:/home/jobe07:/bin/false
jobe08:x:982:991:Jobe server task runner:/home/jobe08:/bin/false
jobe09:x:981:991:Jobe server task runner:/home/jobe09:/bin/false
root@mmbbs-dev:~/jobe#  ps l -u "995,996,997,998"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
1   995   508 20389  20   0 2739324 42332 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995   664 20389  20   0 2744472 55356 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
5   998  2026 20353  20   0  36484  8408 ep_pol S    ?          0:00 nginx: worker process
5   998  2027 20353  20   0  36484  8404 ep_pol S    ?          0:00 nginx: worker process
5   998  2028 20353  20   0  36484 19076 ep_pol S    ?          0:13 nginx: worker process
5   998  2029 20353  20   0  36484  8404 ep_pol S    ?          0:00 nginx: worker process
5   998  2030 20353  20   0  36484  8408 ep_pol S    ?          0:00 nginx: worker process
5   998  2031 20353  20   0  36484  8408 ep_pol S    ?          0:00 nginx: worker process
5   998  2032 20353  20   0  32232  4408 ep_pol S    ?          0:00 nginx: cache manager process
1   995  2148 20389  20   0 2743676 47008 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995  2651 20389  20   0 2744236 52716 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995  2731 20389  20   0 2739288 38620 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995  3352 20389  20   0 2739428 33504 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995  3489 20389  20   0 2736628 16100 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   997  9321 22745  20   0 1422864 875864 poll_s Sl ?          0:33 puma: cluster worker 0: 22745 [gitlab-puma-work
1   997  9481 22745  20   0 1461264 831804 poll_s Sl ?          0:38 puma: cluster worker 5: 22745 [gitlab-puma-work
1   997  9594 22745  20   0 1435152 865784 poll_s Sl ?          0:33 puma: cluster worker 3: 22745 [gitlab-puma-work
1   997  9714 22745  20   0 1461144 883912 poll_s Sl ?          0:35 puma: cluster worker 1: 22745 [gitlab-puma-work
1   997  9826 22745  20   0 1391512 869404 poll_s Sl ?          0:28 puma: cluster worker 4: 22745 [gitlab-puma-work
1   997  9940 22745  20   0 1408920 866676 poll_s Sl ?          0:32 puma: cluster worker 2: 22745 [gitlab-puma-work
4   997 20287   341  20   0 1388668  236 futex_ Ssl  ?         29:52 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gi
0   997 20293 20287  20   0 2017308 113304 futex_ Sl ?        1319:47 /opt/gitlab/embedded/bin/gitaly /var/opt/gitla
4   997 20308   350  20   0 205716 39804 poll_s Ssl  ?        2769:30 /opt/gitlab/embedded/bin/ruby /opt/gitlab/embe
4   997 20320   346  20   0 2030516 37140 futex_ Ssl ?        700:10 /opt/gitlab/embedded/bin/gitlab-workhorse -list
4   995 20369   349  20   0 1551748 14416 ep_pol Ssl ?        567:58 /opt/gitlab/embedded/bin/postgres_exporter --we
0   997 20382 20293  20   0 2744368 116648 poll_s Sl ?        302:57 ruby /opt/gitlab/embedded/service/gitaly-ruby/b
0   997 20383 20293  20   0 3090392 155384 poll_s Sl ?        313:13 ruby /opt/gitlab/embedded/service/gitaly-ruby/b
4   995 20389   338  20   0 2735448 89420 poll_s Ss  ?          5:09 /opt/gitlab/embedded/bin/postgres -D /var/opt/g
1   995 20391 20389  20   0 2735652 115620 ep_pol Ss ?          2:17 postgres: checkpointer
1   995 20392 20389  20   0 2735448 21824 ep_pol Ss  ?          2:24 postgres: background writer
1   995 20393 20389  20   0 2735448 19772 ep_pol Ss  ?          3:02 postgres: walwriter
1   995 20394 20389  20   0 2736148 6080 ep_pol Ss   ?          3:00 postgres: autovacuum launcher
1   995 20395 20389  20   0  18084  3444 ep_pol Ss   ?         54:34 postgres: stats collector
1   995 20396 20389  20   0 2735988 4684 ep_pol Ss   ?          0:08 postgres: logical replication launcher
4   996 20500   339  20   0 113904 25856 ep_pol Ssl  ?        4096:30 /opt/gitlab/embedded/bin/redis-server 127.0.0.
4   996 20506   352  20   0 1550040 14684 futex_ Ssl ?        193:59 /opt/gitlab/embedded/bin/redis_exporter --web.l
1   995 20587 20389  20   0 2764272 62068 ep_pol Ss  ?        1128:58 postgres: gitlab-psql gitlabhq_production [loc
1   995 20603 20389  20   0 2742096 66980 ep_pol Ss  ?        3244:08 postgres: gitlab gitlabhq_production [local] i
4   997 22717   337  20   0 111276 30640 poll_s Ssl  ?          5:16 ruby /opt/gitlab/embedded/service/gitlab-rails/
0   997 22723 22717  20   0 2226812 1055140 poll_s Sl ?       17328:24 sidekiq 6.2.2 queues:authorized_project_updat
4   997 22745   348  20   0 980972 704336 poll_s Ssl ?        126:19 puma 5.3.2 (unix:///var/opt/gitlab/gitlab-rails
1   995 22947 20389  20   0 2739376 24132 ep_pol Ss  ?          9:39 postgres: gitlab gitlabhq_production [local] id
1   995 22948 20389  20   0 2740792 32860 ep_pol Ss  ?          0:17 postgres: gitlab gitlabhq_production [local] id
1   995 29228 20389  20   0 2744796 58448 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id
1   995 31922 20389  20   0 2739676 43200 ep_pol Ss  ?          0:00 postgres: gitlab gitlabhq_production [local] id

trampgeek commented 1 year ago

Good news that you still get no errors after reducing numproc to 30 again. I think you could comfortably use that jobeinabox container if you wanted, but it's really no solution to the problem as you'd have to repeat all the UID fiddling every time you ran up a new container. I need to understand exactly what's happening and fix it.

None of your containers seem to be using any of the default Jobe UIDs. However, I do note that gitlab-runner is using 999 which is the same UID that jobe uses. I don't see how this could cause the problem, but I will pore again over the runguard code. No more time today, though - I have some "real work" to do :-)

Are you easily able to fire up a new jobeinabox container and check if it runs OK while the gitlab container is stopped? No problem if not - you've given me something to ponder, regardless.

Many thanks again for the debugging help. Stay tuned - I hope to come back within a day or two.

jtuttas commented 1 year ago

Ok, thanks a lot for your Help !

trampgeek commented 1 year ago

Aha. That's it! All suddenly is clear. nginx is using UID 998 - same as jobe00. It creates lots of worker processes, so jobe00 doesn't get a chance.

We had a misunderstanding earlier when I asked you to run that command on the host. You gave me the output

root@e38922e4d00a:/var/www/html/jobe# ps l -u "996,997,998,999"
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

I failed to notice that you ran the commands in the container, not on the host!

Many thanks - I now know exactly what the problem is. All I need to do now is figure out how to fix it. That requires some thought.

Stay tuned.

trampgeek commented 1 year ago

I've pushed a change to Jobe to allow customised setting of the UIDs allocated to the jobe processes. I also pushed a new version of Dockerfile and updated the latest jobeinabox image on Docker Hut to make use of the new functionality.

Are you able to check with jobeinabox:latest to confirm that the problem has been solved, please?

Thanks again for the great help in reporting and debugging.

trampgeek / jobeinabox

cannot start `sh': Resource temporarily unavailable #15