Closed xavierog closed 2 years ago
Hi @xavierog thank you for such a detailed investigation. Do you want to open a pull request, add procps to the list of installed packages, and verify that the issue is resolved?
This would need to be added in the NodeBase/Dockerfile so the browser node and standalone images inherit it.
I did look through the upstream docker-selenium images. I don't see procps installed anywhere. I guess some other package we are installing there must be depending on the procps package, but I am not sure.
Also, just curious, do you know why noVNC works even though there is this issue? I am able to view the desktop.
Do you want to open a pull request
Yes, I will probably do so, but dumping my conclusions into this bug report was the most I could do before heading to bed.
add procps to the list of installed packages and verify that the issue is resolved?
Actually, I went for a much simpler approach (involving laziness and some ELF-related luck). Consider these commands:
# mapping /usr/bin/ps as "sp" (typo intended):
podman run -v /usr/bin/ps:/usr/bin/sp -v /lib/aarch64-linux-gnu/libprocps.so.8:/lib/aarch64-linux-gnu/libprocps.so.8 --rm -it -p 4444:4444 --shm-size 800m --cap-add=SYS_PTRACE seleniarm/standalone-chromium:latest
# => I reproduce this issue (of course)
# mapping /usr/bin/ps as "ps", thus making it available:
podman run -v /usr/bin/ps:/usr/bin/ps -v /lib/aarch64-linux-gnu/libprocps.so.8:/lib/aarch64-linux-gnu/libprocps.so.8 --rm -it -p 4444:4444 --shm-size 800m --cap-add=SYS_PTRACE seleniarm/standalone-chromium:latest
# => everything works fine
However, I am not sure adding ps
is THE solution. ps -p ${proxy_pid} >/dev/null
looks like a syscall-expensive way to do [ -d "/proc/${proxy_pid}" ]
This would need to be added in the NodeBase/Dockerfile so the browser node and standalone images inherit it.
Thanks, that should be useful when I have a look at it.
I did look through the docker-selenium images.
Indeed, ps
is present in selenium/standalone-chrome.
I don't see procps installed anywhere. I guess some other package we are installing there must be depending on the procps package, but I am not sure.
That is a good question. apt-cache rdepends
does not really help. Maybe it is there because it also provides kill
among other tools?
Also, just curious, do you know why noVNC works even though there is this issue? I am able to view the desktop.
No idea. This entire bugreport stems from noVNC not working on my side.
I want to share some observations on my end. I ran the latest images, released yesterday (4.5.0), as well as 4.3.0 and 4.4.0. noVNC works in seleniarm/standalone-chromium for me in all 3 versions. However, what's odd is I don't have ps installed.
If adding in procps tools resolves it for you, and gets rid of the error in the logs to make the system run cleaner, then let's continue as planned. But I'm at a loss as to how it works for me, despite the missing ps command.
Just for the purposes of more clearly documenting the issue, here is the command and logs from my macOS M1. You'll see the novnc exit in the terminal logs, and in the next block of log output, in the novnc-stderr.log, you'll see the error on line 183 due to the missing ps:
podman run --rm -it -p 4444:4444 -p 7900:7900 --shm-size 3g seleniarm/standalone-chromium:4.4.0-20220903
2022-09-30 02:53:46,653 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2022-09-30 02:53:46,667 INFO RPC interface 'supervisor' initialized
2022-09-30 02:53:46,668 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2022-09-30 02:53:46,668 INFO supervisord started with pid 3
2022-09-30 02:53:47,675 INFO spawned: 'xvfb' with pid 4
2022-09-30 02:53:47,680 INFO spawned: 'vnc' with pid 5
2022-09-30 02:53:47,686 INFO spawned: 'novnc' with pid 6
2022-09-30 02:53:47,704 INFO spawned: 'selenium-standalone' with pid 8
Setting up SE_NODE_GRID_URL...
2022-09-30 02:53:47,728 INFO success: xvfb entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,729 INFO success: vnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,729 INFO success: novnc entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-09-30 02:53:47,730 INFO success: selenium-standalone entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Selenium Grid Standalone configuration:
[network]
relax-checks = true
[node]
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 0
max-sessions = 1
[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "105.0", "platformName": "Linux"}'
max-sessions = 1
Starting Selenium Grid Standalone...
02:53:49.074 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
2022-09-30 02:53:49,077 INFO exited: novnc (exit status 1; not expected)
02:53:49.081 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
02:53:50.161 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
02:53:50.248 INFO [NodeOptions.report] - Adding chrome for {"browserVersion": "105.0","se:noVncPort": 7900,"browserName": "chrome","platformName": "LINUX","se:vncEnabled": true} 1 times
02:53:50.278 INFO [Node.<init>] - Binding additional locator mechanisms: id, name, relative
02:53:50.321 INFO [GridModel.setAvailability] - Switching Node f5d8fa47-b4bf-4fbd-85d0-ab89310ed315 (uri: http://10.88.0.6:4444) from DOWN to UP
02:53:50.325 INFO [LocalDistributor.add] - Added node f5d8fa47-b4bf-4fbd-85d0-ab89310ed315 at http://10.88.0.6:4444. Health check every 120s
02:53:50.581 INFO [Standalone.execute] - Started Selenium Standalone 4.4.0 (revision e5c75ed026a): http://10.88.0.6:4444
And here is the logs from novnc-stderr.log:
seluser@7a2d27a09eed:/var/log/supervisor$ tail -fn300 novnc-stderr.log
/opt/bin/noVNC/utils/websockify/websockify/websocket.py:31: UserWarning: no 'numpy' module, HyBi protocol will be slower
warnings.warn("no 'numpy' module, HyBi protocol will be slower")
WebSocket server settings:
- Listen on :7900
- Web server. Web root: /opt/bin/noVNC
- No SSL/TLS support (no cert file)
- proxying from :7900 to localhost:5900
/opt/bin/noVNC/utils/launch.sh: line 183: ps: command not found
As we see here, ps is not found. Here's a screenshot of noVNC with Chromium running and a visible "ps not found" in the Debian terminal:
And here is the logs from novnc-stderr.log:
seluser@7a2d27a09eed:/var/log/supervisor$ tail -fn300 novnc-stderr.log /opt/bin/noVNC/utils/websockify/websockify/websocket.py:31: UserWarning: no 'numpy' module, HyBi protocol will be slower warnings.warn("no 'numpy' module, HyBi protocol will be slower") WebSocket server settings: - Listen on :7900 - Web server. Web root: /opt/bin/noVNC - No SSL/TLS support (no cert file) - proxying from :7900 to localhost:5900 /opt/bin/noVNC/utils/launch.sh: line 183: ps: command not found
As we see here, ps is not found.
I think we have a race condition here: in my novnc-stderr.log
, ps: command not found
appears right after the numpy warning and the 5 WebSocket server settings
lines are precisely the write() syscalls that go wrong on my side due to the invalid file descriptors. If your machine is fast enough to write these lines before the ps
-induced invalidation of file descriptors (and an M1 CPU ought to be), then your python/websockify process will simply work like a charm. As to the computational power of my host... well, it is literally named "potato".
Race condition makes a lot of sense. Thanks again for the investigations.
What happened?
Hello,
I first tried selenium/standalone-chrome on x86 and encountered no issue. Later, I tried seleniarm/standalone-chromium and noticed establishing a WebSocket to reach the VNC service was failing. Initial tcpdump-based investigations showed that the websockify process in the container was accepting TCP connections but also closing them right after they were established, without reading any data. Further strace-based investigations showed that the Python process was likely struggling with its stdout and stderr file descriptors being unusable: specifically, writing to stderr results in EPIPE (Broken pipe):
Workaround: entering the container, killing websockify and running it manually (thus providing it with proper stdout and stderr file descriptors) makes the whole thing work again.
But why is the Python/websockify process in such a state? According to
/etc/supervisor/conf.d/selenium.conf
, supervisord should run it with:But alas...
Indeed,
/opt/bin/noVNC/utils/launch.sh
contains:... and ps is indeed nowhere to be found in the image. So, what happens? ps is missing,
ps -p ${proxy_pid}
exits with a non-zero return code (presumably 127), the scriptexit 1
, supervisord notices it and closes its file descriptors to/var/log/supervisor/novnc-stdout.log
and/var/log/supervisor/novnc-stderr.log
. The Python process is still running but its stdout and stderr file descriptors are now unusable.Conclusion: shipping ps in the image should resolve everything. Also: feel free to ship a few other tools that would have made this investigation easier.
Command used to start Selenium Grid with Docker
Relevant log output
Operating System
GNU/Linux Rapsbian 11.4
Docker Selenium version (tag)
4.4.0-20220903