Closed benibr closed 3 months ago
right, I'm looking at the rocky9 development instance in my local setup. It appears to be an issue with the outdated curl / libssh used in the test and the stronger security settings in the grid_sftp service. From state/log/sftp.log:
2024-04-08 13:57:41,529 INFO Starting SFTP server
2024-04-08 13:57:41,529 INFO Listening on address 'sftp.migrid.test' and port 2222
2024-04-08 13:57:41,529 INFO accept connections: window_size 16777216 / max_packet_size 524288
2024-04-08 13:58:15,619 INFO Handling new session from <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('172.18.0.7', 2222), raddr=('172.18.0.1', 51620)> ('172.18.0.1', 51620) (1 active sessions)
2024-04-08 13:58:15,652 INFO Using re-keying sizes 2147483648 bytes / 2147483648 packets
2024-04-08 13:58:15,655 INFO Connected (version 2.0, client libssh2_1.10.0)
2024-04-08 13:58:15,657 ERROR Exception (server): Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,658 ERROR Traceback (most recent call last):
2024-04-08 13:58:15,658 ERROR File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2187, in run
2024-04-08 13:58:15,658 ERROR self._handler_table[ptype](m)
2024-04-08 13:58:15,658 ERROR File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2307, in _negotiate_keys
2024-04-08 13:58:15,658 ERROR self._parse_kex_init(m)
2024-04-08 13:58:15,658 ERROR File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2620, in _parse_kex_init
2024-04-08 13:58:15,658 ERROR raise IncompatiblePeer(
2024-04-08 13:58:15,658 ERROR paramiko.ssh_exception.IncompatiblePeer: Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,658 ERROR
2024-04-08 13:58:15,659 WARNING client negotiation error for ('172.18.0.1', 51620): Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,659 INFO Login from ('172.18.0.1', 51620) failed - closing connection
Upgrading the test to the latest curlimages version seems to help here. I'll try pushing that and see.
Is it possible to rebase the PR or similar to get the CI to re-run with the latest tests
, which succeeded in the CI triggered by my latest push? I tried a simple re-run from the failed CI run linked in details here, but it fails again because it stays on the previous commit without the curl upgrade in tests.
Rebased to current master. rocky9 looks good but rocky8 now gives a 503. Maybe the service started to slow? I'll give it another try
So far so good. I can reproduce a 502 in the https test for rocky8 locally here, however, so that is probably another issue :-/
the migrid
container crashes with rocky8
and development
here. Using make up
it silently passes but e.g. docker ps
will show that the container is missing and a manual docker compose up
yields the explanation in the form of an Apache error:
migrid | Run services: httpd script monitor sshmux events cron transfers imnotify vmproxy notify crond rsyslogd
migrid | httpd: Syntax error on line 49 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_wsgi.so into server: /etc/httpd/modules/mod_wsgi.so: cannot open shared object file: No such file or directory
migrid | Failed to start httpd: 1
migrid exited with code 1
After adding build support for the python2 version of mod_wsgi in Dockerfile.rocky8
it works with and without PREFER_PYTHON3
set. We explicitly set it to True in our rocky8 test instance, so we didn't hit the issue there.
If you force push another PR update I guess it will merge cleanly now as CI from my last push passed without errors.
After adding build support for the python2 version of mod_wsgi in
Dockerfile.rocky8
it works with and withoutPREFER_PYTHON3
set. We explicitly set it to True in our rocky8 test instance, so we didn't hit the issue there.
Thanks for investigating, I just wanted to start diff'ing our prod .envs because I know I stumbled upon this multiple time during testing but couldn't remember the exact parameter that was causing it. Since this is a error that is not easy to find, do you think we should add this to the troubleshooting section in the docs?
Well, this particular cause should be eliminated now, and we are moving to a situation where python2 will finally disappear this summer. Yet, documenting where to look if the migrid
container fails to start like this would be good.
We saw a similar case of httpd
refusing to launch when the usual upgrade of the OpenID Connect auth module failed because of upstream pulling the older RHEL/CentOS 7 rpm package upon new releases and docker builds therefore ended up with the outdated distro version, which didn't support the required Apache conf options.
Hmm there is one problem with SSH in rocky9: https://github.com/ucphhpc/docker-migrid/actions/runs/8598840807/job/23560490622?pr=53