ucphhpc / docker-migrid

Containerized MiG
GNU General Public License v2.0
3 stars 7 forks source link

CI: more container OS variants #53

Closed benibr closed 3 months ago

benibr commented 3 months ago

Hmm there is one problem with SSH in rocky9: https://github.com/ucphhpc/docker-migrid/actions/runs/8598840807/job/23560490622?pr=53

 Running test-30-migrid-sftp-read.sh: failed
WARNING: Localhost DNS setting (--dns=127.0.0.1) may fail in containers.
* processing: sftp://sftp.migrid.test:2222/welcome.txt
*   Trying 172.17.0.1:2222...
* Connected to sftp.migrid.test (172.17.0.1) port 2222
* Failure establishing ssh session: -5, Unable to exchange encryption keys
* Closing connection
make: *** [Makefile:161: test] Error 1
jonasbardino commented 3 months ago

right, I'm looking at the rocky9 development instance in my local setup. It appears to be an issue with the outdated curl / libssh used in the test and the stronger security settings in the grid_sftp service. From state/log/sftp.log:

2024-04-08 13:57:41,529 INFO Starting SFTP server
2024-04-08 13:57:41,529 INFO Listening on address 'sftp.migrid.test' and port 2222
2024-04-08 13:57:41,529 INFO accept connections: window_size 16777216 / max_packet_size 524288
2024-04-08 13:58:15,619 INFO Handling new session from <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('172.18.0.7', 2222), raddr=('172.18.0.1', 51620)> ('172.18.0.1', 51620) (1 active sessions)
2024-04-08 13:58:15,652 INFO Using re-keying sizes 2147483648 bytes / 2147483648 packets
2024-04-08 13:58:15,655 INFO Connected (version 2.0, client libssh2_1.10.0)
2024-04-08 13:58:15,657 ERROR Exception (server): Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,658 ERROR Traceback (most recent call last):
2024-04-08 13:58:15,658 ERROR   File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2187, in run
2024-04-08 13:58:15,658 ERROR     self._handler_table[ptype](m)
2024-04-08 13:58:15,658 ERROR   File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2307, in _negotiate_keys
2024-04-08 13:58:15,658 ERROR     self._parse_kex_init(m)
2024-04-08 13:58:15,658 ERROR   File "/usr/lib/python3.9/site-packages/paramiko/transport.py", line 2620, in _parse_kex_init
2024-04-08 13:58:15,658 ERROR     raise IncompatiblePeer(
2024-04-08 13:58:15,658 ERROR paramiko.ssh_exception.IncompatiblePeer: Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,658 ERROR 
2024-04-08 13:58:15,659 WARNING client negotiation error for ('172.18.0.1', 51620): Incompatible ssh server (no acceptable macs)
2024-04-08 13:58:15,659 INFO Login from ('172.18.0.1', 51620) failed - closing connection

Upgrading the test to the latest curlimages version seems to help here. I'll try pushing that and see.

jonasbardino commented 3 months ago

Is it possible to rebase the PR or similar to get the CI to re-run with the latest tests, which succeeded in the CI triggered by my latest push? I tried a simple re-run from the failed CI run linked in details here, but it fails again because it stays on the previous commit without the curl upgrade in tests.

benibr commented 3 months ago

Rebased to current master. rocky9 looks good but rocky8 now gives a 503. Maybe the service started to slow? I'll give it another try

jonasbardino commented 3 months ago

So far so good. I can reproduce a 502 in the https test for rocky8 locally here, however, so that is probably another issue :-/

jonasbardino commented 3 months ago

the migrid container crashes with rocky8 and development here. Using make up it silently passes but e.g. docker ps will show that the container is missing and a manual docker compose up yields the explanation in the form of an Apache error:

migrid              | Run services: httpd script monitor sshmux events cron transfers imnotify vmproxy notify crond rsyslogd
migrid              | httpd: Syntax error on line 49 of /etc/httpd/conf/httpd.conf: Cannot load modules/mod_wsgi.so into server: /etc/httpd/modules/mod_wsgi.so: cannot open shared object file: No such file or directory
migrid              | Failed to start httpd: 1
migrid exited with code 1
jonasbardino commented 3 months ago

After adding build support for the python2 version of mod_wsgi in Dockerfile.rocky8 it works with and without PREFER_PYTHON3 set. We explicitly set it to True in our rocky8 test instance, so we didn't hit the issue there.

jonasbardino commented 3 months ago

If you force push another PR update I guess it will merge cleanly now as CI from my last push passed without errors.

benibr commented 3 months ago

After adding build support for the python2 version of mod_wsgi in Dockerfile.rocky8 it works with and without PREFER_PYTHON3 set. We explicitly set it to True in our rocky8 test instance, so we didn't hit the issue there.

Thanks for investigating, I just wanted to start diff'ing our prod .envs because I know I stumbled upon this multiple time during testing but couldn't remember the exact parameter that was causing it. Since this is a error that is not easy to find, do you think we should add this to the troubleshooting section in the docs?

jonasbardino commented 3 months ago

Well, this particular cause should be eliminated now, and we are moving to a situation where python2 will finally disappear this summer. Yet, documenting where to look if the migrid container fails to start like this would be good. We saw a similar case of httpd refusing to launch when the usual upgrade of the OpenID Connect auth module failed because of upstream pulling the older RHEL/CentOS 7 rpm package upon new releases and docker builds therefore ended up with the outdated distro version, which didn't support the required Apache conf options.