seleniumhq-community / docker-seleniarm

Multi-Arch (arm64/armhf/amd64) Docker images for the Selenium Grid Server
https://hub.docker.com/u/seleniarm
Other
248 stars 24 forks source link

[🐛 Bug]: noVNC crashing - websockify is not launched properly due to lack of ps command #18

Closed xavierog closed 2 years ago

xavierog commented 2 years ago

What happened?

Hello,

I first tried selenium/standalone-chrome on x86 and encountered no issue. Later, I tried seleniarm/standalone-chromium and noticed establishing a WebSocket to reach the VNC service was failing. Initial tcpdump-based investigations showed that the websockify process in the container was accepting TCP connections but also closing them right after they were established, without reading any data. Further strace-based investigations showed that the Python process was likely struggling with its stdout and stderr file descriptors being unusable: specifically, writing to stderr results in EPIPE (Broken pipe):

# Accepting the incoming TCP connection:
accept4(3, {sa_family=AF_INET, sin_port=htons(42242), sin_addr=inet_addr("10.0.2.100")}, [16], SOCK_CLOEXEC) = 4
getsockname(4, {sa_family=AF_INET, sin_port=htons(7900), sin_addr=inet_addr("10.0.2.100")}, [128->16]) = 0

# Removing some boring stuff here: see "Relevant log output" if needed

# Relevant stuff: trying to write() to stderr leads to EPIPE, which likely triggers exceptions, until Python somehow gives up:
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\n", 278) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---
getpid()                                = 38
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\nhandler exception: [Errno 32] Broken pipe\n", 32
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\nhandler exception: [Errno 32] Broken pipe\n--- L
ogging error ---\n", 342) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---

# Closing the TCP connection:
close(4)                                = 0
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)

Workaround: entering the container, killing websockify and running it manually (thus providing it with proper stdout and stderr file descriptors) makes the whole thing work again.

But why is the Python/websockify process in such a state? According to /etc/supervisor/conf.d/selenium.conf, supervisord should run it with:

stdout_logfile=/var/log/supervisor/novnc-stdout.log
stderr_logfile=/var/log/supervisor/novnc-stderr.log

But alas...

$ cat /var/log/supervisor/novnc-stderr.log
/opt/bin/noVNC/utils/websockify/websockify/websocket.py:31: UserWarning: no 'numpy' module, HyBi protocol will be slower
  warnings.warn("no 'numpy' module, HyBi protocol will be slower")
/opt/bin/noVNC/utils/launch.sh: line 183: ps: command not found

Indeed, /opt/bin/noVNC/utils/launch.sh contains:

echo "Starting webserver and WebSockets proxy on port ${PORT}"
#${HERE}/websockify --web ${WEB} ${CERT:+--cert ${CERT}} ${PORT} ${VNC_DEST} &
${WEBSOCKIFY} ${SYSLOG_ARG} ${SSLONLY} --web ${WEB} ${CERT:+--cert ${CERT}} ${KEY:+--key ${KEY}} ${PORT} ${VNC_DEST} ${HEARTBEAT_ARG} ${IDLETIMEOUT_ARG} ${RECORD_ARG} ${TIMEOUT_ARG} &
proxy_pid="$!"
sleep 1
if ! ps -p ${proxy_pid} >/dev/null; then
    proxy_pid=
    echo "Failed to start WebSockets proxy"
    exit 1
fi

... and ps is indeed nowhere to be found in the image. So, what happens? ps is missing, ps -p ${proxy_pid} exits with a non-zero return code (presumably 127), the script exit 1, supervisord notices it and closes its file descriptors to /var/log/supervisor/novnc-stdout.log and /var/log/supervisor/novnc-stderr.log. The Python process is still running but its stdout and stderr file descriptors are now unusable.

Conclusion: shipping ps in the image should resolve everything. Also: feel free to ship a few other tools that would have made this investigation easier.

Command used to start Selenium Grid with Docker

podman run --rm -it -p 4444:4444 --shm-size 800m --cap-add=SYS_PTRACE seleniarm/standalone-chromium:latest

Relevant log output

accept4(3, {sa_family=AF_INET, sin_port=htons(42242), sin_addr=inet_addr("10.0.2.100")}, [16], SOCK_CLOEXEC) = 4
getsockname(4, {sa_family=AF_INET, sin_port=htons(7900), sin_addr=inet_addr("10.0.2.100")}, [128->16]) = 0
getpid()                                = 38
getpid()                                = 38
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing", {st_mode=S_IFDIR|0755, st_size=576, ...}, 0) = 0
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/popen_fork.py", {st_mode=S_IFREG|0644, st_size=2377, ...}, 0) = 0
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/popen_fork.py", {st_mode=S_IFREG|0644, st_size=2377, ...}, 0) = 0
openat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/__pycache__/popen_fork.cpython-310.pyc", O_RDONLY|O_CLOEXEC) = 5
newfstatat(5, "", {st_mode=S_IFREG|0644, st_size=2521, ...}, AT_EMPTY_PATH) = 0
ioctl(5, TCGETS, 0xffffc2bc5bd0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(5, 0, SEEK_CUR)                   = 0
lseek(5, 0, SEEK_CUR)                   = 0
newfstatat(5, "", {st_mode=S_IFREG|0644, st_size=2521, ...}, AT_EMPTY_PATH) = 0
read(5, "o\r\r\n\0\0"..., 2522) = 2521
read(5, "", 1)                          = 0
close(5)                                = 0
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing", {st_mode=S_IFDIR|0755, st_size=576, ...}, 0) = 0
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/util.py", {st_mode=S_IFREG|0644, st_size=14024, ...}, 0) = 0
newfstatat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/util.py", {st_mode=S_IFREG|0644, st_size=14024, ...}, 0) = 0
openat(AT_FDCWD, "/usr/lib/python3.10/multiprocessing/__pycache__/util.cpython-310.pyc", O_RDONLY|O_CLOEXEC) = 5
newfstatat(5, "", {st_mode=S_IFREG|0644, st_size=11497, ...}, AT_EMPTY_PATH) = 0                                                                                  
ioctl(5, TCGETS, 0xffffc2bc45f0)        = -1 ENOTTY (Inappropriate ioctl for device)
lseek(5, 0, SEEK_CUR)                   = 0
lseek(5, 0, SEEK_CUR)                   = 0
newfstatat(5, "", {st_mode=S_IFREG|0644, st_size=11497, ...}, AT_EMPTY_PATH) = 0
read(5, "o\r\r\n\0\0\0\0D\22"..., 11498) = 11497
read(5, "", 1)                          = 0
close(5)                                = 0
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\n", 278) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---
getpid()                                = 38
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\nhandler exception: [Errno 32] Broken pipe\n", 32
0) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---
write(2, "WebSocket server settings:\n--- Logging error ---\n  - Listen on :7900\n--- Logging error ---\n  - Web server. Web root: /opt/bin/noVNC\n--- Logging error ---\n  -
 No SSL/TLS support (no cert file)\n--- Logging error ---\n  - proxying from :7900 to localhost:5900\n--- Logging error ---\nhandler exception: [Errno 32] Broken pipe\n--- L
ogging error ---\n", 342) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=38, si_uid=1200} ---
close(4)                                = 0
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)
pselect6(4, [3], [], [], {tv_sec=1, tv_nsec=0}, NULL) = 0 (Timeout)

Operating System

GNU/Linux Rapsbian 11.4

Docker Selenium version (tag)

4.4.0-20220903

jamesmortensen commented 2 years ago

Hi @xavierog thank you for such a detailed investigation. Do you want to open a pull request, add procps to the list of installed packages, and verify that the issue is resolved?

This would need to be added in the NodeBase/Dockerfile so the browser node and standalone images inherit it.

I did look through the upstream docker-selenium images. I don't see procps installed anywhere. I guess some other package we are installing there must be depending on the procps package, but I am not sure.

Also, just curious, do you know why noVNC works even though there is this issue? I am able to view the desktop.

xavierog commented 2 years ago

Do you want to open a pull request

Yes, I will probably do so, but dumping my conclusions into this bug report was the most I could do before heading to bed.

add procps to the list of installed packages and verify that the issue is resolved?

Actually, I went for a much simpler approach (involving laziness and some ELF-related luck). Consider these commands:

# mapping /usr/bin/ps as "sp" (typo intended):
podman run -v /usr/bin/ps:/usr/bin/sp -v /lib/aarch64-linux-gnu/libprocps.so.8:/lib/aarch64-linux-gnu/libprocps.so.8 --rm -it -p 4444:4444 --shm-size 800m --cap-add=SYS_PTRACE seleniarm/standalone-chromium:latest
# => I reproduce this issue (of course)

# mapping /usr/bin/ps as "ps", thus making it available:
podman run -v /usr/bin/ps:/usr/bin/ps -v /lib/aarch64-linux-gnu/libprocps.so.8:/lib/aarch64-linux-gnu/libprocps.so.8 --rm -it -p 4444:4444 --shm-size 800m --cap-add=SYS_PTRACE  seleniarm/standalone-chromium:latest
# => everything works fine

However, I am not sure adding ps is THE solution. ps -p ${proxy_pid} >/dev/null looks like a syscall-expensive way to do [ -d "/proc/${proxy_pid}" ]

This would need to be added in the NodeBase/Dockerfile so the browser node and standalone images inherit it.

Thanks, that should be useful when I have a look at it.

I did look through the docker-selenium images.

Indeed, ps is present in selenium/standalone-chrome.

I don't see procps installed anywhere. I guess some other package we are installing there must be depending on the procps package, but I am not sure.

That is a good question. apt-cache rdepends does not really help. Maybe it is there because it also provides kill among other tools?

Also, just curious, do you know why noVNC works even though there is this issue? I am able to view the desktop.

No idea. This entire bugreport stems from noVNC not working on my side.

jamesmortensen commented 2 years ago

I want to share some observations on my end. I ran the latest images, released yesterday (4.5.0), as well as 4.3.0 and 4.4.0. noVNC works in seleniarm/standalone-chromium for me in all 3 versions. However, what's odd is I don't have ps installed.

If adding in procps tools resolves it for you, and gets rid of the error in the logs to make the system run cleaner, then let's continue as planned. But I'm at a loss as to how it works for me, despite the missing ps command.

Just for the purposes of more clearly documenting the issue, here is the command and logs from my macOS M1. You'll see the novnc exit in the terminal logs, and in the next block of log output, in the novnc-stderr.log, you'll see the error on line 183 due to the missing ps:

podman run --rm -it -p 4444:4444 -p 7900:7900 --shm-size 3g seleniarm/standalone-chromium:4.4.0-20220903

2022-09-30 02:53:46,653 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2022-09-30 02:53:46,667 INFO RPC interface 'supervisor' initialized
2022-09-30 02:53:46,668 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2022-09-30 02:53:46,668 INFO supervisord started with pid 3
2022-09-30 02:53:47,675 INFO spawned: 'xvfb' with pid 4
2022-09-30 02:53:47,680 INFO spawned: 'vnc' with pid 5
2022-09-30 02:53:47,686 INFO spawned: 'novnc' with pid 6
2022-09-30 02:53:47,704 INFO spawned: 'selenium-standalone' with pid 8
Setting up SE_NODE_GRID_URL...
2022-09-30 02:53:47,728 INFO success: xvfb entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,729 INFO success: vnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,729 INFO success: novnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,730 INFO success: selenium-standalone entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Selenium Grid Standalone configuration: 
[network]
relax-checks = true

[node]
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 0
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "105.0", "platformName": "Linux"}'
max-sessions = 1

Starting Selenium Grid Standalone...
02:53:49.074 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
2022-09-30 02:53:49,077 INFO exited: novnc (exit status 1; not expected)
02:53:49.081 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
02:53:50.161 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
02:53:50.248 INFO [NodeOptions.report] - Adding chrome for {"browserVersion": "105.0","se:noVncPort": 7900,"browserName": "chrome","platformName": "LINUX","se:vncEnabled": true} 1 times
02:53:50.278 INFO [Node.<init>] - Binding additional locator mechanisms: id, name, relative
02:53:50.321 INFO [GridModel.setAvailability] - Switching Node f5d8fa47-b4bf-4fbd-85d0-ab89310ed315 (uri: http://10.88.0.6:4444) from DOWN to UP
02:53:50.325 INFO [LocalDistributor.add] - Added node f5d8fa47-b4bf-4fbd-85d0-ab89310ed315 at http://10.88.0.6:4444. Health check every 120s
02:53:50.581 INFO [Standalone.execute] - Started Selenium Standalone 4.4.0 (revision e5c75ed026a): http://10.88.0.6:4444

And here is the logs from novnc-stderr.log:

seluser@7a2d27a09eed:/var/log/supervisor$ tail -fn300 novnc-stderr.log 
/opt/bin/noVNC/utils/websockify/websockify/websocket.py:31: UserWarning: no 'numpy' module, HyBi protocol will be slower
  warnings.warn("no 'numpy' module, HyBi protocol will be slower")
WebSocket server settings:
  - Listen on :7900
  - Web server. Web root: /opt/bin/noVNC
  - No SSL/TLS support (no cert file)
  - proxying from :7900 to localhost:5900
/opt/bin/noVNC/utils/launch.sh: line 183: ps: command not found

As we see here, ps is not found. Here's a screenshot of noVNC with Chromium running and a visible "ps not found" in the Debian terminal:

Screenshot 2022-09-30 at 8 49 10 AM
xavierog commented 2 years ago

And here is the logs from novnc-stderr.log:

seluser@7a2d27a09eed:/var/log/supervisor$ tail -fn300 novnc-stderr.log 
/opt/bin/noVNC/utils/websockify/websockify/websocket.py:31: UserWarning: no 'numpy' module, HyBi protocol will be slower
  warnings.warn("no 'numpy' module, HyBi protocol will be slower")
WebSocket server settings:
  - Listen on :7900
  - Web server. Web root: /opt/bin/noVNC
  - No SSL/TLS support (no cert file)
  - proxying from :7900 to localhost:5900
/opt/bin/noVNC/utils/launch.sh: line 183: ps: command not found

As we see here, ps is not found.

I think we have a race condition here: in my novnc-stderr.log, ps: command not found appears right after the numpy warning and the 5 WebSocket server settings lines are precisely the write() syscalls that go wrong on my side due to the invalid file descriptors. If your machine is fast enough to write these lines before the ps-induced invalidation of file descriptors (and an M1 CPU ought to be), then your python/websockify process will simply work like a charm. As to the computational power of my host... well, it is literally named "potato".

jamesmortensen commented 2 years ago

Race condition makes a lot of sense. Thanks again for the investigations.