Closed behnle closed 1 week ago
Hi @behnle!
This seems like a hard problem to track down. I cannot reproduce anything like this if I e.g. run a local Oasis and use the central authentication, or in our current deployments which are based on 1.3.3 and 1.3.4 and use the central authentication.
I think @blueraft or @markus1978 might know better if anything critical has changed in 1.3.3 that could cause this. Would you know which version of the nomad-fair
docker image worked for you previously, or if you have updated your Keycloak version (or was it also 25.0.2 when you had a working NOMAD deployment)? This might help in tracking down the issue.
I don't believe anything changed with regards to keycloak. @Sideboard was looking into updating the keycloak version but we are still using keycloak:16.1.1
in our examples.
Thanks for Your replies @lauri-codes @blueraft . The last version that is remember to work was an 1.2.2 (?) image (with SHA256 sum 279c097945fe553be09e8f50d0502f20210836eff3d8b5c6b2213f8297b32724)
[root@u-030-s007 nomad]# docker image inspect 279c097945fe [ { "Id": "sha256:279c097945fe553be09e8f50d0502f20210836eff3d8b5c6b2213f8297b32724", "RepoTags": [], "RepoDigests": [ "gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair@sha256:be4b78aa30969cd88b6ca23841a07282496649e0a98b7645b607b072ddf235a2" ], "Parent": "", "Comment": "buildkit.dockerfile.v0", "Created": "2024-02-06T15:04:27.110797819+01:00", "DockerVersion": "", "Author": "", "Config": { "Hostname": "", "Domainname": "", "User": "nomad", "AttachStdin": false, "AttachStdout": false, "AttachStderr": false, "ExposedPorts": { "8000/tcp": {}, "9000/tcp": {} }, "Tty": false, "OpenStdin": false, "StdinOnce": false, "Env": [ "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "LANG=C.UTF-8", "GPG_KEY=E3FF2839C048B25C084DEBE9B26995E310250568", "PYTHON_VERSION=3.9.18", "PYTHON_PIP_VERSION=23.0.1", "PYTHON_SETUPTOOLS_VERSION=58.1.0", "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/049c52c665e8c5fd1751f942316e0a5c777d304f/public/get-pip.py", "PYTHON_GET_PIP_SHA256=7cfd4bdc4d475ea971f1c0710a5953bcc704d171f83c797b9529d9974502fcc6", "PYTHONPATH=/app/plugins" ], "Cmd": [ "python3" ], "ArgsEscaped": true, "Image": "", "Volumes": { "/app/.volumes/fs": {} }, "WorkingDir": "/app", "Entrypoint": null, "OnBuild": null, "Labels": null }, "Architecture": "amd64", "Os": "linux", "Size": 1883886525, "GraphDriver": { "Data": { "LowerDir": "/dockerdata/volumes/overlay2/115e904ebcd76c1bccfcda5549ab1681b895babfbe1968d16afa6677daa3bf26/diff:/dockerdata/volumes/overlay2/dc501ca5cfceb3e0334bf4caa3bccd2f2113a06cc9cb94f259362a9f5726b663/diff:/dockerdata/volumes/overlay2/d676ebd69562462e8b2be0084c697feab1a0426a9446cd5f5cdc402210842722/diff:/dockerdata/volumes/overlay2/ad6d73441a8db834f40746356502198b73ec3debfabd0185ce6ed9ede70f056c/diff:/dockerdata/volumes/overlay2/7294a6e95fc74c2a72da9f485fcb67095896fdb103969faba9971e9dd25e4582/diff:/dockerdata/volumes/overlay2/9d8ec0b0c1ca04bfe0293af096cde23979b7f707892ce372f14635718686645c/diff:/dockerdata/volumes/overlay2/af761b214b08f5ba00d1d01db95ff92213a4557a97a40b10ea5832d240f3151e/diff:/dockerdata/volumes/overlay2/433c9a4a3676ffe5e142deb6b01dff36581174cacf86293054770853c6f03ce9/diff:/dockerdata/volumes/overlay2/3efe44a4d06aa62463c0e0cdb7aa697d80891f174c2320c716c6f0fdf7d4e08d/diff:/dockerdata/volumes/overlay2/f5eb910c7070ddb5cab7071b6c841c80a2aecb28a1f0aeed9a51ad48f08b7c17/diff:/dockerdata/volumes/overlay2/a3e9dabd0f9f4cc114303a898c0bb565e9fab76b5ca4af66a91148912df4ff8b/diff:/dockerdata/volumes/overlay2/005be47edaf1c594bfaaff029c48eaf1f246688f4e81f5f2c54e3003168f0af0/diff:/dockerdata/volumes/overlay2/bd771adf438b9ff7270519229801783985b783cd689d88ea11318296fa360deb/diff", "MergedDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/merged", "UpperDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/diff", "WorkDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/work" }, "Name": "overlay2" }, "RootFS": { "Type": "layers", "Layers": [ "sha256:fb1bd2fc52827db4ce719cc1aafd4a035d68bc71183b3bc39014f23e9e5fa256", "sha256:da5d55102092b80b04fcb9e6cce42b12f7c53ed72cb1568811576763c9d40786", "sha256:c4e334227ccac6bda44f5768a5459ad5f8def8e9bb3df0e5323feffd89b9480b", "sha256:087aa9f40b611f4de7ee0079dfd3600cc038b8be247f82c6abf3b99df7a5624d", "sha256:18a1e69d7a2d521683b54e7deadc70dbd2b498b68ea2e05115e14c147a5497ff", "sha256:a0254b855be6bf5ecad7b09b7b97f28be0b2676e9fbb99b04610318d02cbe279", "sha256:e31f05acf7dd708f0f905abfc969cd41eb63afe3a61cc614a61d3be17aec75df", "sha256:666aa383b6a2a9c2be8d2f5fa200c20e909ef78b0a9d5f1ce35cb20c9100b100", "sha256:fd9f50d931645dab893e2da7c6e77d480117012892612785ffad21f3e57c9b04", "sha256:de8b1961466c64fc0577d8a297b3a38a0ee9da8f4ca4dc416f3e2f7acc9ff7c0", "sha256:2d93d7cba9761cdc66514bc06946c1ec8724d63ca1632fe15a579d2e679ca7bb", "sha256:32867b0930e582286f6909b92d3cfca0acea0cdfa91b5c2aeb2eb66a80b59c1d", "sha256:f2772f2d7f778831050d8d449ee9f0a8930cc9fccfcf3bf23766aa78922f6062", "sha256:2fb27a2af30200bbeee57f32f116200433ffc2333254894109c642b44739a3ea" ] }, "Metadata": { "LastTagTime": "0001-01-01T00:00:00Z" } } ]
I also just downloaded the sample config folder and was able to deploy an 1.3.3 Oasis with Keycloak 16. But this version is out of support for decades (still fighting with an attempt to replace it by Keycloak 25).
The last actions i made was first to update Keycloak from 24.0.5. to 25.0.2 (after which all clients were still working). Afterwards i updated NOMAD from 1.2.2 (?)/the previous "latest" to 1.3.3, which then caused said issue. All other clients still work. Unfortunately the migration steps section does not mention any mandatory steps for going from 1.2.2 to 1.3.3, thus i assumed that any mandatory action will be executed on first start in the background. Before doing the update, i did not delete any container volume. Maybe there is now some old data leftover which causes the issue. Can you tell me which container volume i can safely delete without losing scientific data? Is there any possibility to increase the log level and inspect the authentication process (specifically why the token that is obviously sent never makes it to the web storage as bearer token)?
It should be possible to increase the Python log level in nomad.yaml
(services.console_log_level
) to something like DEBUG
, but I would assume that the default level of WARNING
would already catch any possible problem. I don't think anything that is persisted in the Docker volumes can affect the authentication process, and in general, I would avoid removing any of the volumes if not absolutely necessary.
My first thought would be that there is some incompatibility with Keycloak 25.0.2 and the docker image for NOMAD 1.3.3 (but maybe even older versions of NOMAD). To try and reproduce the problem locally, we could spin up a Keycloak service with version 25.0.2 alongside the other services in the default docker-compose
Oasis setup and see if things break. The other likely option could be some JS authentication lib incompatibility.
deploy an 1.3.3 Oasis with Keycloak 16. But this version is out of support for decades
I wasn't able to find release cycle date info for keycloack. 16.1.1
got released Jan 2022 so I'd be surprised if this was already end of life-d.
Unfortunately, only the latest version of Keycloak receives security fixes (https://github.com/keycloak/keycloak/security/policy#supported-versions), and even if you buy LTS from RedHat, the oldest version they provide backports for is now 22.x (https://access.redhat.com/articles/7033107). Keycloak has a terribly rapid release cycle, i wish they would spend more time on QA and less time on agile feature development.
That's unfortunate, I'll check with Sascha about updating to v25 and let you know if we're able to fix the compatibility issue.
While i still am unable to explain and solve the issue, i can at least provide you with a set of config files for an MWE that reproduces the issue. The example is adapted from here. Strip the .txt extension for usage when replacing the files from the original example. The realm import mechanism of Keycloak has been overhauled since v16, hence i placed the realm to be imported in a subdirectory below "configs". Credentials are still "admin" "password". The MWE behaves as in the original report, namely that you are successfully redirected to Keycloak for SSO authorization and then redirected back, but NOMAD still treats you like you are not logged in. docker-compose.yaml.txt nginx.conf.txt nomad.yaml.txt nomad-realm.json
Thank you for the MWE. I am able to reproduce this.
One thing I've noticed is that the keycloak
response to authorization requests does not include an Authorization
cookie. Can you confirm if it's the same for you?
With v25:
With central NOMAD:
Indeed i can confirm that the web storage does not contain an authorization cookie after authorization: Pre-login: After login: But there are clearly authorization and code-to-token events logged in KC: (you have to first turn on user event logging) The question is now where do these get lost? Because down the road of the browser console, there is the following post event: with the cookie tab being, the request and the reply The call stack of this POST is I personally would interpret it such that KC does indeed send a bearer token, but for some obscure reason, there is no cookie afterwards. I though have to admit that this is way outside my comfort zone.
Thank you for confirming, I'll take a look tomorrow with v1.2 to see if we are doing something differently there.
I've used the same docker compose file and used nomad v1.2.1 and it doesn't work there either. Same issue with no Authorization cookie being sent back in response headers by Keycloack. Something probably changed on keycloak side then I'd imagine. Does v1.2.1
work for you with the same docker compose file?
If I understood correclty, @behnle already tested that with keycloak 24.0.5 everything worked fine in combination with nomad 1.3.3. So I would assume that something happened in the transition from 24.0.5 to 25.0.2.
It might be worthwhile to check if v24.0.2 works, and then check the keycloak changelogs. Maybe 25.0.2 needs us to update our JS keycloak version (keycloak-js
in package.json)?
@blueraft I didn't try NOMAD 1.2.1 yet, that's for sure worth a try. Just have to figure how to pull it from your registry. @lauri-codes Yes and no (NOMAD continued to work in my production environment after updating KC to 24.0.5), but i just managed to get the MWE up and running. With the following keycloak setting and dropping KC healthcheck in other containers, it seems to work:
# keycloak user management keycloak: restart: unless-stopped #image: quay.io/keycloak/keycloak:16.1.1 image: quay.io/keycloak/keycloak:24.0.5 container_name: nomad_oasis_keycloak environment: - TZ=Europe/Berlin - PROXY_ADDRESS_FORWARDING=true - KEYCLOAK_ADMIN=admin - KEYCLOAK_ADMIN_PASSWORD=password - KEYCLOAK_USER=admin - KEYCLOAK_PASSWORD=password # - KEYCLOAK_FRONTEND_URL=http://localhost/keycloak/auth - KC_HOSTNAME_STRICT=false - KC_HTTP_ENABLED=true - KC_HTTP_PORT=8080 - KC_PROXY=edge #- KC_LOG_LEVEL=DEBUG # - KC_HOSTNAME=http://localhost/keycloak/ - KC_HOSTNAME_URL=http://localhost/keycloak/ #- KEYCLOAK_IMPORT=/opt/keycloak/data/import/nomad-realm.json -Dkeycloak.profile.feature.upload_scripts=enabled" - KEYCLOAK_EXTRA_ARGS_PREPENDED="--proxy-headers xforwarded --hostname-debug=true --http-enabled true --health-enabled=true --verbose" #- KEYCLOAK_EXTRA_ARGS="--import-realm --verbose" command: start-dev --import-realm #- "-Dkeycloak.import=/opt/keycloak/data/import -Dkeycloak.migration.strategy=IGNORE_EXISTING" # start-dev --import-realm volumes: - keycloak:/opt/keycloak/data - ./configs/keycloak-import/:/opt/keycloak/data/import:ro # healthcheck: # #test: # test: ["CMD-SHELL", "exec 3<>/dev/tcp/127.0.0.1/9000;echo -e 'GET /health/ready HTTP/1.1\r\nhost: http://localhost\r\nConnection: close\r\n\r\n' >&3;if [ $? -eq 0 ]; then echo 'Healthcheck Successful';exit 0;else echo 'Healthcheck Failed';exit 1;fi;"] # # - "CMD" # # - "curl" # # - "--fail" # # - "--silent" # # - "http://127.0.0.1:9990/health/live" # # - "http://keycloak:9000/health/live" #interval: 10s #timeout: 10s #retries: 30 #start_period: 30s
i.e. i am able to perform SSO login. I did not dive that deep into the NOMAD code, but if you use a js keycloak package and not just a generic OIDC library, it might well require an update. Unfortunately, KC developers tend to break compatibility on a daily basis.
Just have to figure how to pull it from your registry.
In the docker compose file, this would be for the app and the worker:
image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:v1.2.1
i.e. i am able to perform SSO login.
Good to know this works, probably just some breaking change from 24 to 25 then.
Just checked the 1.3.4 image mentioned in https://github.com/nomad-coe/nomad/issues/107#issuecomment-2258064124, unfortunately, the problem still seems to persist with exactly the same symptoms.
Further diagnostics: The version of the python-keycloak
package that is bundled with the nomad 1.3.4 image (4.2.0) should be compatible to Keycloak 25.x, see:
nomad@7c6efa176f81:/app$ pip3 show python-keycloak
WARNING: The directory '/home/nomad/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Name: python-keycloak
Version: 4.2.0
Summary: python-keycloak is a Python package providing access to the Keycloak API.
Home-page:
Author: Marcos Pereira
Author-email: marcospereira.mpj@gmail.com
License: MIT
Location: /usr/local/lib/python3.9/site-packages
Requires: async-property, deprecation, httpx, jwcrypto, requests, requests-toolbelt
Required-by: nomad-lab
(it's rather impressive that it also works with keycloak 16...) Unfortunately, i was unable to identify the javascript source that is responsible for per forming the OIDC-SSO login, if you could give me a hint i might be able to further track down the root cause of the reported behaviour.
Can you try the following image, I have updated the keycloak js library and it seems to work for me locally.
gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:keycloak-update
Hi @blueraft,
excellent, this essentially seems to fix the issue with not being able to log in with KC 25.0.2. Big thanks!
Do you mind to reveal what exactly you changed (which js file, as i was unable to locate it in the container or the git repo) and on which nomad version the image is based ("about" says version: 1.3.5.dev76+g8151ab3e8
)?
In other words, is this image suitable for being deployed in my production environment without making a mess of my user's data and breaking the ability for further updates?
This is not suitable for a production environment yet. The merge request is yet to reviewed and merged, but I can provide an update tomorrow.
Here's the diff if you're interested but I'd recommend waiting till it's merged before trying it in production.
Thanks @blueraft for looking into this :+1: I will review it and then we test it on one of our deployments to see if it breaks anything with our keycloak version. If everything goes smoothly, this will be a part of 1.3.5.
Thanks for the heads-up! Better take care not to break nomad-lab.eu ... :grin:
The fix for keycloak 25 compatibility (we needed to update the JS library) is now part of version 1.3.5.
Thanks for fixing. I can confirm that it also works with the 1.3.6 release containers that i recently installed.
Dear NOMAD developers, i operate a NOMAD Oasis with decentralized user management. After an update of NOMAD to version 1.3.3 / the latest docker image, i am unable to log into my Oasis. The setup is as follows:
Observations:
The setup used to work flawlessly until i updated NOMAD to v1.3.3
NOMAD stack:
images:
The (redacted) keycloak part of
nomad.yaml
:There are no obvious errors in the docker-compose logs of Keycloak or NOMAD, there are no errors in the Keycloak GUI, there are no errors in the browser console, it just looks as if NOMAD does not set the session cookie. Have there been any changes in NOMAD from 1.2 to 1.3 which would require a reconfiguration of the client settings in Keycloak? What can i do to further track down the root cause of the issue? The only maybe relevant warning is the following:
If it helps, i can also provide you with the client settings in Keycloak