nomad-coe / nomad

NOMAD lets you manage and share your materials science data in a way that makes it truly useful to you, your group, and the community.
https://nomad-lab.eu
Apache License 2.0
70 stars 16 forks source link

Oasis operation broken after update to 1.3.3 #108

Closed behnle closed 1 week ago

behnle commented 1 month ago

Dear NOMAD developers, i operate a NOMAD Oasis with decentralized user management. After an update of NOMAD to version 1.3.3 / the latest docker image, i am unable to log into my Oasis. The setup is as follows:

Observations:

If it helps, i can also provide you with the client settings in Keycloak

lauri-codes commented 1 month ago

Hi @behnle!

This seems like a hard problem to track down. I cannot reproduce anything like this if I e.g. run a local Oasis and use the central authentication, or in our current deployments which are based on 1.3.3 and 1.3.4 and use the central authentication.

I think @blueraft or @markus1978 might know better if anything critical has changed in 1.3.3 that could cause this. Would you know which version of the nomad-fair docker image worked for you previously, or if you have updated your Keycloak version (or was it also 25.0.2 when you had a working NOMAD deployment)? This might help in tracking down the issue.

blueraft commented 1 month ago

I don't believe anything changed with regards to keycloak. @Sideboard was looking into updating the keycloak version but we are still using keycloak:16.1.1 in our examples.

behnle commented 1 month ago

Thanks for Your replies @lauri-codes @blueraft . The last version that is remember to work was an 1.2.2 (?) image (with SHA256 sum 279c097945fe553be09e8f50d0502f20210836eff3d8b5c6b2213f8297b32724)

docker image inspect
[root@u-030-s007 nomad]# docker image inspect 279c097945fe
[
    {
        "Id": "sha256:279c097945fe553be09e8f50d0502f20210836eff3d8b5c6b2213f8297b32724",
        "RepoTags": [],
        "RepoDigests": [
            "gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair@sha256:be4b78aa30969cd88b6ca23841a07282496649e0a98b7645b607b072ddf235a2"
        ],
        "Parent": "",
        "Comment": "buildkit.dockerfile.v0",
        "Created": "2024-02-06T15:04:27.110797819+01:00",
        "DockerVersion": "",
        "Author": "",
        "Config": {
            "Hostname": "",
            "Domainname": "",
            "User": "nomad",
            "AttachStdin": false,
            "AttachStdout": false,
            "AttachStderr": false,
            "ExposedPorts": {
                "8000/tcp": {},
                "9000/tcp": {}
            },
            "Tty": false,
            "OpenStdin": false,
            "StdinOnce": false,
            "Env": [
                "PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
                "LANG=C.UTF-8",
                "GPG_KEY=E3FF2839C048B25C084DEBE9B26995E310250568",
                "PYTHON_VERSION=3.9.18",
                "PYTHON_PIP_VERSION=23.0.1",
                "PYTHON_SETUPTOOLS_VERSION=58.1.0",
                "PYTHON_GET_PIP_URL=https://github.com/pypa/get-pip/raw/049c52c665e8c5fd1751f942316e0a5c777d304f/public/get-pip.py",
                "PYTHON_GET_PIP_SHA256=7cfd4bdc4d475ea971f1c0710a5953bcc704d171f83c797b9529d9974502fcc6",
                "PYTHONPATH=/app/plugins"
            ],
            "Cmd": [
                "python3"
            ],
            "ArgsEscaped": true,
            "Image": "",
            "Volumes": {
                "/app/.volumes/fs": {}
            },
            "WorkingDir": "/app",
            "Entrypoint": null,
            "OnBuild": null,
            "Labels": null
        },
        "Architecture": "amd64",
        "Os": "linux",
        "Size": 1883886525,
        "GraphDriver": {
            "Data": {
                "LowerDir": "/dockerdata/volumes/overlay2/115e904ebcd76c1bccfcda5549ab1681b895babfbe1968d16afa6677daa3bf26/diff:/dockerdata/volumes/overlay2/dc501ca5cfceb3e0334bf4caa3bccd2f2113a06cc9cb94f259362a9f5726b663/diff:/dockerdata/volumes/overlay2/d676ebd69562462e8b2be0084c697feab1a0426a9446cd5f5cdc402210842722/diff:/dockerdata/volumes/overlay2/ad6d73441a8db834f40746356502198b73ec3debfabd0185ce6ed9ede70f056c/diff:/dockerdata/volumes/overlay2/7294a6e95fc74c2a72da9f485fcb67095896fdb103969faba9971e9dd25e4582/diff:/dockerdata/volumes/overlay2/9d8ec0b0c1ca04bfe0293af096cde23979b7f707892ce372f14635718686645c/diff:/dockerdata/volumes/overlay2/af761b214b08f5ba00d1d01db95ff92213a4557a97a40b10ea5832d240f3151e/diff:/dockerdata/volumes/overlay2/433c9a4a3676ffe5e142deb6b01dff36581174cacf86293054770853c6f03ce9/diff:/dockerdata/volumes/overlay2/3efe44a4d06aa62463c0e0cdb7aa697d80891f174c2320c716c6f0fdf7d4e08d/diff:/dockerdata/volumes/overlay2/f5eb910c7070ddb5cab7071b6c841c80a2aecb28a1f0aeed9a51ad48f08b7c17/diff:/dockerdata/volumes/overlay2/a3e9dabd0f9f4cc114303a898c0bb565e9fab76b5ca4af66a91148912df4ff8b/diff:/dockerdata/volumes/overlay2/005be47edaf1c594bfaaff029c48eaf1f246688f4e81f5f2c54e3003168f0af0/diff:/dockerdata/volumes/overlay2/bd771adf438b9ff7270519229801783985b783cd689d88ea11318296fa360deb/diff",
                "MergedDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/merged",
                "UpperDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/diff",
                "WorkDir": "/dockerdata/volumes/overlay2/88f956e0ebbe0d92b456f097129f166b0deef067b767b2dd0ff65cef9d847b77/work"
            },
            "Name": "overlay2"
        },
        "RootFS": {
            "Type": "layers",
            "Layers": [
                "sha256:fb1bd2fc52827db4ce719cc1aafd4a035d68bc71183b3bc39014f23e9e5fa256",
                "sha256:da5d55102092b80b04fcb9e6cce42b12f7c53ed72cb1568811576763c9d40786",
                "sha256:c4e334227ccac6bda44f5768a5459ad5f8def8e9bb3df0e5323feffd89b9480b",
                "sha256:087aa9f40b611f4de7ee0079dfd3600cc038b8be247f82c6abf3b99df7a5624d",
                "sha256:18a1e69d7a2d521683b54e7deadc70dbd2b498b68ea2e05115e14c147a5497ff",
                "sha256:a0254b855be6bf5ecad7b09b7b97f28be0b2676e9fbb99b04610318d02cbe279",
                "sha256:e31f05acf7dd708f0f905abfc969cd41eb63afe3a61cc614a61d3be17aec75df",
                "sha256:666aa383b6a2a9c2be8d2f5fa200c20e909ef78b0a9d5f1ce35cb20c9100b100",
                "sha256:fd9f50d931645dab893e2da7c6e77d480117012892612785ffad21f3e57c9b04",
                "sha256:de8b1961466c64fc0577d8a297b3a38a0ee9da8f4ca4dc416f3e2f7acc9ff7c0",
                "sha256:2d93d7cba9761cdc66514bc06946c1ec8724d63ca1632fe15a579d2e679ca7bb",
                "sha256:32867b0930e582286f6909b92d3cfca0acea0cdfa91b5c2aeb2eb66a80b59c1d",
                "sha256:f2772f2d7f778831050d8d449ee9f0a8930cc9fccfcf3bf23766aa78922f6062",
                "sha256:2fb27a2af30200bbeee57f32f116200433ffc2333254894109c642b44739a3ea"
            ]
        },
        "Metadata": {
            "LastTagTime": "0001-01-01T00:00:00Z"
        }
    }
]

I also just downloaded the sample config folder and was able to deploy an 1.3.3 Oasis with Keycloak 16. But this version is out of support for decades (still fighting with an attempt to replace it by Keycloak 25).

The last actions i made was first to update Keycloak from 24.0.5. to 25.0.2 (after which all clients were still working). Afterwards i updated NOMAD from 1.2.2 (?)/the previous "latest" to 1.3.3, which then caused said issue. All other clients still work. Unfortunately the migration steps section does not mention any mandatory steps for going from 1.2.2 to 1.3.3, thus i assumed that any mandatory action will be executed on first start in the background. Before doing the update, i did not delete any container volume. Maybe there is now some old data leftover which causes the issue. Can you tell me which container volume i can safely delete without losing scientific data? Is there any possibility to increase the log level and inspect the authentication process (specifically why the token that is obviously sent never makes it to the web storage as bearer token)?

lauri-codes commented 1 month ago

It should be possible to increase the Python log level in nomad.yaml (services.console_log_level) to something like DEBUG, but I would assume that the default level of WARNING would already catch any possible problem. I don't think anything that is persisted in the Docker volumes can affect the authentication process, and in general, I would avoid removing any of the volumes if not absolutely necessary.

My first thought would be that there is some incompatibility with Keycloak 25.0.2 and the docker image for NOMAD 1.3.3 (but maybe even older versions of NOMAD). To try and reproduce the problem locally, we could spin up a Keycloak service with version 25.0.2 alongside the other services in the default docker-compose Oasis setup and see if things break. The other likely option could be some JS authentication lib incompatibility.

blueraft commented 1 month ago

deploy an 1.3.3 Oasis with Keycloak 16. But this version is out of support for decades

I wasn't able to find release cycle date info for keycloack. 16.1.1 got released Jan 2022 so I'd be surprised if this was already end of life-d.

behnle commented 1 month ago

Unfortunately, only the latest version of Keycloak receives security fixes (https://github.com/keycloak/keycloak/security/policy#supported-versions), and even if you buy LTS from RedHat, the oldest version they provide backports for is now 22.x (https://access.redhat.com/articles/7033107). Keycloak has a terribly rapid release cycle, i wish they would spend more time on QA and less time on agile feature development.

blueraft commented 1 month ago

That's unfortunate, I'll check with Sascha about updating to v25 and let you know if we're able to fix the compatibility issue.

behnle commented 1 month ago

While i still am unable to explain and solve the issue, i can at least provide you with a set of config files for an MWE that reproduces the issue. The example is adapted from here. Strip the .txt extension for usage when replacing the files from the original example. The realm import mechanism of Keycloak has been overhauled since v16, hence i placed the realm to be imported in a subdirectory below "configs". Credentials are still "admin" "password". The MWE behaves as in the original report, namely that you are successfully redirected to Keycloak for SSO authorization and then redirected back, but NOMAD still treats you like you are not logged in. docker-compose.yaml.txt nginx.conf.txt nomad.yaml.txt nomad-realm.json

blueraft commented 1 month ago

Thank you for the MWE. I am able to reproduce this.

One thing I've noticed is that the keycloak response to authorization requests does not include an Authorization cookie. Can you confirm if it's the same for you?

With v25:

Screenshot 2024-07-25 at 13 04 17

With central NOMAD:

Screenshot 2024-07-25 at 13 04 38
behnle commented 1 month ago

Indeed i can confirm that the web storage does not contain an authorization cookie after authorization: Pre-login: pre-login After login: post-login But there are clearly authorization and code-to-token events logged in KC: events (you have to first turn on user event logging) The question is now where do these get lost? Because down the road of the browser console, there is the following post event: header with the cookie tab being, cookie the request request and the reply reply The call stack of this POST is callstack I personally would interpret it such that KC does indeed send a bearer token, but for some obscure reason, there is no cookie afterwards. I though have to admit that this is way outside my comfort zone.

blueraft commented 1 month ago

Thank you for confirming, I'll take a look tomorrow with v1.2 to see if we are doing something differently there.

blueraft commented 1 month ago

I've used the same docker compose file and used nomad v1.2.1 and it doesn't work there either. Same issue with no Authorization cookie being sent back in response headers by Keycloack. Something probably changed on keycloak side then I'd imagine. Does v1.2.1 work for you with the same docker compose file?

lauri-codes commented 1 month ago

If I understood correclty, @behnle already tested that with keycloak 24.0.5 everything worked fine in combination with nomad 1.3.3. So I would assume that something happened in the transition from 24.0.5 to 25.0.2.

It might be worthwhile to check if v24.0.2 works, and then check the keycloak changelogs. Maybe 25.0.2 needs us to update our JS keycloak version (keycloak-js in package.json)?

behnle commented 1 month ago

@blueraft I didn't try NOMAD 1.2.1 yet, that's for sure worth a try. Just have to figure how to pull it from your registry. @lauri-codes Yes and no (NOMAD continued to work in my production environment after updating KC to 24.0.5), but i just managed to get the MWE up and running. With the following keycloak setting and dropping KC healthcheck in other containers, it seems to work:

  # keycloak user management
  keycloak:
    restart: unless-stopped
    #image: quay.io/keycloak/keycloak:16.1.1
    image: quay.io/keycloak/keycloak:24.0.5
    container_name: nomad_oasis_keycloak
    environment:
      - TZ=Europe/Berlin
      - PROXY_ADDRESS_FORWARDING=true
      - KEYCLOAK_ADMIN=admin
      - KEYCLOAK_ADMIN_PASSWORD=password
      - KEYCLOAK_USER=admin
      - KEYCLOAK_PASSWORD=password
      #      - KEYCLOAK_FRONTEND_URL=http://localhost/keycloak/auth
      - KC_HOSTNAME_STRICT=false
      - KC_HTTP_ENABLED=true
      - KC_HTTP_PORT=8080
      - KC_PROXY=edge
      #- KC_LOG_LEVEL=DEBUG
      #      - KC_HOSTNAME=http://localhost/keycloak/
      - KC_HOSTNAME_URL=http://localhost/keycloak/
      #- KEYCLOAK_IMPORT=/opt/keycloak/data/import/nomad-realm.json -Dkeycloak.profile.feature.upload_scripts=enabled"
      - KEYCLOAK_EXTRA_ARGS_PREPENDED="--proxy-headers xforwarded --hostname-debug=true --http-enabled true --health-enabled=true --verbose"
      #- KEYCLOAK_EXTRA_ARGS="--import-realm --verbose"
    command: start-dev --import-realm
      #- "-Dkeycloak.import=/opt/keycloak/data/import -Dkeycloak.migration.strategy=IGNORE_EXISTING"
      #      start-dev --import-realm
    volumes:
      - keycloak:/opt/keycloak/data
      - ./configs/keycloak-import/:/opt/keycloak/data/import:ro
      # healthcheck:
      #   #test:
      #   test: ["CMD-SHELL", "exec 3<>/dev/tcp/127.0.0.1/9000;echo -e 'GET /health/ready HTTP/1.1\r\nhost: http://localhost\r\nConnection: close\r\n\r\n' >&3;if [ $? -eq 0 ]; then echo 'Healthcheck Successful';exit 0;else echo 'Healthcheck Failed';exit 1;fi;"]
      #   #  - "CMD"
      #   #  - "curl"
      #   #  - "--fail"
      #   #  - "--silent"
      #   #  - "http://127.0.0.1:9990/health/live"
      #   #  - "http://keycloak:9000/health/live"
      #interval: 10s
      #timeout: 10s
      #retries: 30
      #start_period: 30s

i.e. i am able to perform SSO login. I did not dive that deep into the NOMAD code, but if you use a js keycloak package and not just a generic OIDC library, it might well require an update. Unfortunately, KC developers tend to break compatibility on a daily basis.

blueraft commented 1 month ago

Just have to figure how to pull it from your registry.

In the docker compose file, this would be for the app and the worker:

    image: gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:v1.2.1

i.e. i am able to perform SSO login.

Good to know this works, probably just some breaking change from 24 to 25 then.

behnle commented 1 month ago

Just checked the 1.3.4 image mentioned in https://github.com/nomad-coe/nomad/issues/107#issuecomment-2258064124, unfortunately, the problem still seems to persist with exactly the same symptoms.

behnle commented 1 month ago

Further diagnostics: The version of the python-keycloak package that is bundled with the nomad 1.3.4 image (4.2.0) should be compatible to Keycloak 25.x, see:

nomad@7c6efa176f81:/app$ pip3 show python-keycloak
WARNING: The directory '/home/nomad/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you should use sudo's -H flag.
Name: python-keycloak
Version: 4.2.0
Summary: python-keycloak is a Python package providing access to the Keycloak API.
Home-page: 
Author: Marcos Pereira
Author-email: marcospereira.mpj@gmail.com
License: MIT
Location: /usr/local/lib/python3.9/site-packages
Requires: async-property, deprecation, httpx, jwcrypto, requests, requests-toolbelt
Required-by: nomad-lab

(it's rather impressive that it also works with keycloak 16...) Unfortunately, i was unable to identify the javascript source that is responsible for per forming the OIDC-SSO login, if you could give me a hint i might be able to further track down the root cause of the reported behaviour.

blueraft commented 1 month ago

Can you try the following image, I have updated the keycloak js library and it seems to work for me locally.

gitlab-registry.mpcdf.mpg.de/nomad-lab/nomad-fair:keycloak-update

behnle commented 1 month ago

Hi @blueraft, excellent, this essentially seems to fix the issue with not being able to log in with KC 25.0.2. Big thanks! Do you mind to reveal what exactly you changed (which js file, as i was unable to locate it in the container or the git repo) and on which nomad version the image is based ("about" says version: 1.3.5.dev76+g8151ab3e8)? In other words, is this image suitable for being deployed in my production environment without making a mess of my user's data and breaking the ability for further updates?

blueraft commented 1 month ago

This is not suitable for a production environment yet. The merge request is yet to reviewed and merged, but I can provide an update tomorrow.

Here's the diff if you're interested but I'd recommend waiting till it's merged before trying it in production.

lauri-codes commented 1 month ago

Thanks @blueraft for looking into this :+1: I will review it and then we test it on one of our deployments to see if it breaks anything with our keycloak version. If everything goes smoothly, this will be a part of 1.3.5.

behnle commented 1 month ago

Thanks for the heads-up! Better take care not to break nomad-lab.eu ... :grin:

lauri-codes commented 1 week ago

The fix for keycloak 25 compatibility (we needed to update the JS library) is now part of version 1.3.5.

behnle commented 1 week ago

Thanks for fixing. I can confirm that it also works with the 1.3.6 release containers that i recently installed.