Open jtuttas opened 1 year ago
I have just this week become aware of a significant issue with jobeinabox if running it on a host that's providing other services. Although I haven't yet confirmed it, I think that can give rise to the error message you're quoting.
I've also seen that error message arising on CentOS and RHEL servers. Those OSs don't play well with Docker.
So, can you please advise what OS the host is running, what its configuration is (memory size, number of CPUs), and whether the host is providing other services as well as Jobe.
Hello,the docker host is an Ubuntu 20.04 VM with 6 Virtual CPUs and 10G Ram. The Hypervisor I think ist ProxMox.
beside the docker system is a native installed JOBE (that runs correct on a different port of course) and gitlab-ce ! But we want to switch from the native install JOBE Server to the Docker container, because of security reasons (otherwise for example a python program can read the host file system !)
OK, thanks for the info. Would you mind opening a shell inside the container, navigating to /var/www/html/jobe, and typing the following command, please:
python3 testsubmit.py
Please paste the entire output into your response or attach it as a file.
Also, could you let me know the output from the following two commands when executed on the host, please:
grep ":99" /etc/passwd
ps l -u "996,997,998,999"
As an aside, with regard to security issues, Jobe is pretty secure, and has had several security audits over the years. Is it really a problem that the jobs can read the file system? I'd hope that no sensitive information was in world-readable files.
A jobe task has significantly less rights than a logged-in user on the system (e.g. limited processes, memory, job time).
Hello, when i run python3 testsubmit.py I got a huge list auf failed Tests, like
***************** FAILED TEST ******************
{'run_id': None, 'outcome': 11, 'cmpinfo': "/var/www/html/jobe/application/libraries/../../runguard/runguard: cannot start `sh': Resource temporarily unavailable\nTry `/var/www/html/jobe/application/libraries/../../runguard/runguard --help' for more information.\n", 'stdout': '', 'stderr': ''}
Valid Python3
Jobe result: Compile error
Compiler output:
/var/www/html/jobe/application/libraries/../../runguard/runguard: cannot start `sh': Resource temporarily unavailable
Try `/var/www/html/jobe/application/libraries/../../runguard/runguard --help' for more information.
************************************************
Attached is the complete report !
And this is the output of the other two commands:
root@e38922e4d00a:/var/www/html/jobe# grep ":99" /etc/passwd
jobe:x:999:999:Jobe user. Provides home for runs and files.:/home/jobe:/bin/false
jobe00:x:998:999:Jobe server task runner:/home/jobe00:/bin/false
jobe01:x:997:999:Jobe server task runner:/home/jobe01:/bin/false
jobe02:x:996:999:Jobe server task runner:/home/jobe02:/bin/false
jobe03:x:995:999:Jobe server task runner:/home/jobe03:/bin/false
jobe04:x:994:999:Jobe server task runner:/home/jobe04:/bin/false
jobe05:x:993:999:Jobe server task runner:/home/jobe05:/bin/false
jobe06:x:992:999:Jobe server task runner:/home/jobe06:/bin/false
jobe07:x:991:999:Jobe server task runner:/home/jobe07:/bin/false
root@e38922e4d00a:/var/www/html/jobe# ps l -u "996,997,998,999"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
Well and for the security issue, yes I believe that jobe is secure, but hard to say, if there are'nt any files with read permissions to the world. That's why I prefer to have everything inside a docker container !
This is very interesting. You're the second person reporting this problem and the symptoms are almost identical. Did you notice that the Java jobs ran fine? And that 3 of the 10 identical C jobs thrown at the server in quick succession passed, whereas the same job had failed multiple times earlier in the test run.
In discussion with the other guy reporting this problem I've identified the problem as being the process limit. Java uses a large process limit (ulimit NPROC) of several hundred whereas the default for most jobs is 30. It turns out that the C jobs (and probably all the others too) run fine if you raise the process limit. It also seems that some of the higher-numbered jobe users aren't affected but because Jobe users are allocated in order jobe00, jobe01 etc, the higher-numbered ones never get to run jobs unless you toss a lot of jobs at Jobe in a short space of time (which is why 3 of the 10 ran, or so I conjecture).
I had been theorising that because the user namespaces in the container are shared with the host (unless you're running the docker daemon with isolated user namespaces - do you know?), there must be processes running on the host with the same UID as that of the jobe users in the container. You do indeed have such users - the ones created by jobe on the host - but there's no way they can be clocking up sufficient processes to block the ones in the container (if that's even possible - I'm not sure how ulimits are enforced within the container).
I'm now frankly baffled and not sure what to suggest. The other guy is also running a host system of Ubuntu 20:04, but that should be fine. It could be significant that I only recently changed the base OS for jobeinabox from Ubuntu 20:04 to Ubuntu 22:04, but I wouldn't have thought that would matter. And I just now fired up jobeinabox on a 20:04 host with no problems.
Clutching at straws here but ... would you mind building your own jobeinabox image from scratch, please (see here) and confirming nothing changes, please? And then, to be quite sure, editing the Docker file to change the base OS back to Ubuntu 20:04 and build with that? But you'll also need to edit the line
openjdk-18-jdk \
to
openjdk-16-jdk \
I'd really like to get to the heart of this - two such reports in the space of week suggests something "interesting" has happened.
Ok I rebuild the Image from the Docker file, that I have downgraded to ubuntu 20:04 and jdk16. Unfortunately with no effect, se the output of python3 testsubmit.py !
Many thanks for that - it's very helpful to at least eliminate that possibility, but it doesn't leave us any closer to an explanation.
The dialogue with the other guy is here if you want to see where the issue is at. You'll see that they just switched to running jobeinabox on an AWS server.
I'm frankly baffled, with little to suggest. I can't debug if I can't replicate the problem.
You could perhaps check that the problem goes away if you change line 44 of jobe/application/libraries/LanguageTask.php from
'numprocs' => 30,
to
'numprocs' => 200,
But even if it does (as I expect), I'd be unhappy running a server with such an inexplicable behaviour. The tests should nearly all run with a value of 2.
The only thing I can see in common between your system and the other guys is that you're both running additional services in containers - they're running Moodle, you're running gitlab. Are you perhaps able to stop gitlab (and any other docker processes) and check Jobe again? This isn't quite as silly as it sounds - they do all share the same UID space.
Do you have any other suggestions yourself?
I've thought of one other test you could do, if you wouldn't mind, please?
In the container:
apt update
apt install nano
nano /etc/login.defs
Uncomment the line
#SYS_UID_MAX 999
and set the value to 800 instead.
Similarly uncomment and set SYS_GID_MAX to 800.
Then:
cd /var/www/html/jobe
./install --purge
cat /etc/passwd # Confirm that UIDs are all <= 800.
python3 testsubmit.py
Does it work OK now?
I've try the first approach and set numprocs to 200 an this looks relay good. I only got 3 Errors ! I will continue with your second approach !
Now with yours second approach....
GREAT!!, No error was reportet (but I still have set numprocs to 200, shall I reduce it ?
Yes please - I'd like to be reassured that you still get no errors with it set back to 30.
Many thanks for your debugging support. This is definitely progress, of a sort. If you still get no errors, as I would hope, I'd be fairly confident that you have a working Jobe server. But I really would like to know why those top few system-level user IDs are causing problems. The implication is that something else in your system is using them too. And running lots of processes with them. But what? They're not being used by the host (we checked the password file and did a ps to be sure) so I can only assume that another container is using them.
Are you able to list all running containers (docker ps), then exec the following two commands in all of them to see if we can find the culprit?
grep ":99" /etc/passwd
ps l -u "995,996,997,998"
If that last command throws up a list of processes within any of the containers, we've found the problem!
Ok, reduces the numproc to 30 again, and have NO errors when i run python3 testsubmit.py.
I have 3 running docker containers. A gitlab runner, a self developed application and jobe, here are the results of the command:
gitlab runner:
root@1e7a7e4bf873:/# grep ":99" /etc/passwd
gitlab-runner:x:999:999:GitLab Runner:/home/gitlab-runner:/bin/bash
root@1e7a7e4bf873:/# ps l -u "995,996,997,998"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
Self developed Application
root@07b52853eb4b:/usr/src/app# grep ":99" /etc/passwd
root@07b52853eb4b:/usr/src/app# ps l -u "995,996,997,998"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
JOBE
root@abfa600fcac8:/# grep ":99" /etc/passwd
root@abfa600fcac8:/# ps l -u "995,996,997,998"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
By the way, when I rund these commands on the host system, I've got below's output. On the Host systems is installed gitlab and jobe! May be this could help too !
But I'am happy to have the jobeinbox running fine. It would be nice to know what I have to change in the Dockerfile, so that i can build a working image :-) !
root@mmbbs-dev:~/jobe# grep ":99" /etc/passwd
systemd-coredump:x:999:999:systemd Core Dumper:/:/usr/sbin/nologin
gitlab-www:x:998:998::/var/opt/gitlab/nginx:/bin/false
git:x:997:997::/var/opt/gitlab:/bin/sh
gitlab-redis:x:996:996::/var/opt/gitlab/redis:/bin/false
gitlab-psql:x:995:995::/var/opt/gitlab/postgresql:/bin/sh
registry:x:994:994::/var/opt/gitlab/registry:/bin/sh
gitlab-prometheus:x:993:993::/var/opt/gitlab/prometheus:/bin/sh
gitlab-runner:x:992:992:GitLab Runner:/home/gitlab-runner:/bin/bash
jobe:x:991:991:Jobe user. Provides home for runs and files.:/home/jobe:/bin/false
jobe00:x:990:991:Jobe server task runner:/home/jobe00:/bin/false
jobe01:x:989:991:Jobe server task runner:/home/jobe01:/bin/false
jobe02:x:988:991:Jobe server task runner:/home/jobe02:/bin/false
jobe03:x:987:991:Jobe server task runner:/home/jobe03:/bin/false
jobe04:x:986:991:Jobe server task runner:/home/jobe04:/bin/false
jobe05:x:985:991:Jobe server task runner:/home/jobe05:/bin/false
jobe06:x:984:991:Jobe server task runner:/home/jobe06:/bin/false
jobe07:x:983:991:Jobe server task runner:/home/jobe07:/bin/false
jobe08:x:982:991:Jobe server task runner:/home/jobe08:/bin/false
jobe09:x:981:991:Jobe server task runner:/home/jobe09:/bin/false
root@mmbbs-dev:~/jobe# ps l -u "995,996,997,998"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 995 508 20389 20 0 2739324 42332 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 664 20389 20 0 2744472 55356 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
5 998 2026 20353 20 0 36484 8408 ep_pol S ? 0:00 nginx: worker process
5 998 2027 20353 20 0 36484 8404 ep_pol S ? 0:00 nginx: worker process
5 998 2028 20353 20 0 36484 19076 ep_pol S ? 0:13 nginx: worker process
5 998 2029 20353 20 0 36484 8404 ep_pol S ? 0:00 nginx: worker process
5 998 2030 20353 20 0 36484 8408 ep_pol S ? 0:00 nginx: worker process
5 998 2031 20353 20 0 36484 8408 ep_pol S ? 0:00 nginx: worker process
5 998 2032 20353 20 0 32232 4408 ep_pol S ? 0:00 nginx: cache manager process
1 995 2148 20389 20 0 2743676 47008 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 2651 20389 20 0 2744236 52716 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 2731 20389 20 0 2739288 38620 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 3352 20389 20 0 2739428 33504 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 3489 20389 20 0 2736628 16100 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 997 9321 22745 20 0 1422864 875864 poll_s Sl ? 0:33 puma: cluster worker 0: 22745 [gitlab-puma-work
1 997 9481 22745 20 0 1461264 831804 poll_s Sl ? 0:38 puma: cluster worker 5: 22745 [gitlab-puma-work
1 997 9594 22745 20 0 1435152 865784 poll_s Sl ? 0:33 puma: cluster worker 3: 22745 [gitlab-puma-work
1 997 9714 22745 20 0 1461144 883912 poll_s Sl ? 0:35 puma: cluster worker 1: 22745 [gitlab-puma-work
1 997 9826 22745 20 0 1391512 869404 poll_s Sl ? 0:28 puma: cluster worker 4: 22745 [gitlab-puma-work
1 997 9940 22745 20 0 1408920 866676 poll_s Sl ? 0:32 puma: cluster worker 2: 22745 [gitlab-puma-work
4 997 20287 341 20 0 1388668 236 futex_ Ssl ? 29:52 /opt/gitlab/embedded/bin/gitaly-wrapper /opt/gi
0 997 20293 20287 20 0 2017308 113304 futex_ Sl ? 1319:47 /opt/gitlab/embedded/bin/gitaly /var/opt/gitla
4 997 20308 350 20 0 205716 39804 poll_s Ssl ? 2769:30 /opt/gitlab/embedded/bin/ruby /opt/gitlab/embe
4 997 20320 346 20 0 2030516 37140 futex_ Ssl ? 700:10 /opt/gitlab/embedded/bin/gitlab-workhorse -list
4 995 20369 349 20 0 1551748 14416 ep_pol Ssl ? 567:58 /opt/gitlab/embedded/bin/postgres_exporter --we
0 997 20382 20293 20 0 2744368 116648 poll_s Sl ? 302:57 ruby /opt/gitlab/embedded/service/gitaly-ruby/b
0 997 20383 20293 20 0 3090392 155384 poll_s Sl ? 313:13 ruby /opt/gitlab/embedded/service/gitaly-ruby/b
4 995 20389 338 20 0 2735448 89420 poll_s Ss ? 5:09 /opt/gitlab/embedded/bin/postgres -D /var/opt/g
1 995 20391 20389 20 0 2735652 115620 ep_pol Ss ? 2:17 postgres: checkpointer
1 995 20392 20389 20 0 2735448 21824 ep_pol Ss ? 2:24 postgres: background writer
1 995 20393 20389 20 0 2735448 19772 ep_pol Ss ? 3:02 postgres: walwriter
1 995 20394 20389 20 0 2736148 6080 ep_pol Ss ? 3:00 postgres: autovacuum launcher
1 995 20395 20389 20 0 18084 3444 ep_pol Ss ? 54:34 postgres: stats collector
1 995 20396 20389 20 0 2735988 4684 ep_pol Ss ? 0:08 postgres: logical replication launcher
4 996 20500 339 20 0 113904 25856 ep_pol Ssl ? 4096:30 /opt/gitlab/embedded/bin/redis-server 127.0.0.
4 996 20506 352 20 0 1550040 14684 futex_ Ssl ? 193:59 /opt/gitlab/embedded/bin/redis_exporter --web.l
1 995 20587 20389 20 0 2764272 62068 ep_pol Ss ? 1128:58 postgres: gitlab-psql gitlabhq_production [loc
1 995 20603 20389 20 0 2742096 66980 ep_pol Ss ? 3244:08 postgres: gitlab gitlabhq_production [local] i
4 997 22717 337 20 0 111276 30640 poll_s Ssl ? 5:16 ruby /opt/gitlab/embedded/service/gitlab-rails/
0 997 22723 22717 20 0 2226812 1055140 poll_s Sl ? 17328:24 sidekiq 6.2.2 queues:authorized_project_updat
4 997 22745 348 20 0 980972 704336 poll_s Ssl ? 126:19 puma 5.3.2 (unix:///var/opt/gitlab/gitlab-rails
1 995 22947 20389 20 0 2739376 24132 ep_pol Ss ? 9:39 postgres: gitlab gitlabhq_production [local] id
1 995 22948 20389 20 0 2740792 32860 ep_pol Ss ? 0:17 postgres: gitlab gitlabhq_production [local] id
1 995 29228 20389 20 0 2744796 58448 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
1 995 31922 20389 20 0 2739676 43200 ep_pol Ss ? 0:00 postgres: gitlab gitlabhq_production [local] id
Good news that you still get no errors after reducing numproc to 30 again. I think you could comfortably use that jobeinabox container if you wanted, but it's really no solution to the problem as you'd have to repeat all the UID fiddling every time you ran up a new container. I need to understand exactly what's happening and fix it.
None of your containers seem to be using any of the default Jobe UIDs. However, I do note that gitlab-runner is using 999 which is the same UID that jobe uses. I don't see how this could cause the problem, but I will pore again over the runguard code. No more time today, though - I have some "real work" to do :-)
Are you easily able to fire up a new jobeinabox container and check if it runs OK while the gitlab container is stopped? No problem if not - you've given me something to ponder, regardless.
Many thanks again for the debugging help. Stay tuned - I hope to come back within a day or two.
Ok, thanks a lot for your Help !
Aha. That's it! All suddenly is clear. nginx is using UID 998 - same as jobe00. It creates lots of worker processes, so jobe00 doesn't get a chance.
We had a misunderstanding earlier when I asked you to run that command on the host. You gave me the output
root@e38922e4d00a:/var/www/html/jobe# ps l -u "996,997,998,999"
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
I failed to notice that you ran the commands in the container, not on the host!
Many thanks - I now know exactly what the problem is. All I need to do now is figure out how to fix it. That requires some thought.
Stay tuned.
I've pushed a change to Jobe to allow customised setting of the UIDs allocated to the jobe processes. I also pushed a new version of Dockerfile and updated the latest jobeinabox image on Docker Hut to make use of the new functionality.
Are you able to check with jobeinabox:latest to confirm that the problem has been solved, please?
Thanks again for the great help in reporting and debugging.
Hello, we are using JOBE for some year now, now we are switching to a new (virtual) Server. I start the JOBE Docker container like on the old Server!
But launching a Code Runner Question i've got the following error-message:
Did you have any idea what could go wrong ?
best regards
jtuttas