netbirdio / netbird

Connect your devices into a secure WireGuard®-based overlay network with SSO, MFA and granular access controls.
https://netbird.io
BSD 3-Clause "New" or "Revised" License
10.66k stars 476 forks source link

Netbird can't query users when using newer versions than Zitadel 2.61.0 #2616

Open Kidswiss opened 2 days ago

Kidswiss commented 2 days ago

Describe the problem

When updating Zitadel to 2.61.2 or anything newer, then Netbird can't query the Zitadel user endpoint anymore.

To Reproduce

Steps to reproduce the behavior:

  1. Install Netbird v0.29.3
  2. Install Zitdadel 2.61.1 or newer
  3. Get 403 during login

Expected behavior

Zitadel integration should still work if it gets updated.

Are you using NetBird Cloud?

Selfhosted

NetBird version

0.29.3

Additional context

Add any other context about the problem here.

Netbird management logs

2024-09-18T14:31:43Z WARN [context: SYSTEM] management/server/account.go:1017: failed warming up cache due to error: unable to post https://idp.secret.ch/management/v1/users/_search, statusCode 403

Zitadel log entries:

time="2024-09-18T14:36:52Z" level=warning msg="token verifier repo: decrypt access token" caller="/home/runner/work/zitadel/zitadel/internal/authz/repository/eventsourcing/eventstore/token_verifier.go:283" error="ID=APP-ASdgg Message=invalid token"

I've tried re-creating the service account secret, but the error persisted. Also, not sure if this is an issue on Zitadel's side or on Netbird. But given that Netbird is the only app I had issues with, I opened a bug here.

landmass-deftly-reptile-budget commented 2 days ago

Can confirm the same issue with Zitadel 2.62.1 and Netbird 0.29.3 Additional logs from netbird-management container: ERRO [requestID: 098374cd-f244-4be6-91f4-9b3e02fb292f, context: HTTP] management/server/http/util/util.go:81: got a handler error: token invalid ERRO [context: HTTP, requestID: 098374cd-f244-4be6-91f4-9b3e02fb292f] management/server/http/middleware/auth_middleware.go:89: Error when validating JWT claims: unable to post https://bla.blabla.com/management/v1/users/_search, statusCode 403

The logs in the Zitadel container are identical like above.

It worked before months and several version (combinations) of Netbird and Zitadel. I am usually quite fast with updates and had no issues so far until the last update of Netbird and Zitadel. So I guess something has changed either in Netbird or Zitadel in the last 1-2 releases which is the root cause of this issue.

bcmmbaga commented 1 day ago

I see that Zitadel released v2.62.1 two days ago, but they have now marked v2.59.3 as the latest version. Could you try using v2.59.3 (latest) for now or rollback to the previous version that was working for you?

In meantime we will run tests to confirm the breaking changes and update the NetBird Zitadel implementation accordingly.

landmass-deftly-reptile-budget commented 1 day ago

I see that Zitadel released v2.62.1 two days ago, but they have now marked v2.59.3 as the latest version. Could you try using v2.59.3 (latest) for now or rollback to the previous version that was working for you?

In meantime we will run tests to confirm the breaking changes and update the NetBird Zitadel implementation accordingly.

This is for sure some mistake by Zitadel tagging this version 2.59.3 as "latest". See https://github.com/zitadel/zitadel/releases They have several versions updated in the last days with all these three bug fixes mentioned (from 2.54.x to 2.62.x).

adasauce commented 1 day ago

I just wanted to follow up with both a "me too" and some info from the zitadel side. the events history does say a token was created and authenticated properly for me. so it appears to be some kind of permission issue just with the netbird user accessing that endpoint.

This was all working previously for many months.

I have some experience writing integrations with zitadel, I'll poke around to see what netbird is calling vs. what the api is expecting.

edit:

I added some extra logging and error response parsing into the management server and zitadel is responding with:

failed warming up cache due to error: zitadel error code: 7 message: could not read projectid by clientid (AUTH-GHpw2)

will continue poking around

edit2:

so it looks like the client id we're using to authenticate "netbird" by the docs, + the client secret are getting encoded into the JWT returned from zitadel. and we're using that client id "netbird" to make requests.

zitadel on the on the otherhand is doing some work to verify the access token and they're looking up the client_id from the access token we pass in. they're looking up that client_id in the registered apps list to see which app and project it should belong to. but "netbird" isn't the client id of the app, it's 234872394...@netbird.

however if we use that client id to perform the management query, they're logging this error:

oidc_error.parent="ID=QUERY-Dfbg2 Message=Errors.User.NotFound Parent=(sql: no rows in result set)" oidc_error.description="client not found" oidc_error.type=invalid_client status_code=400

there's definitely some confusion happening on what credentials should be used

adasauce commented 23 hours ago

another follow-up:

I added a PAT for the netbird user and made changes to the management service overloading the ClientSecret and Authenticate method to just make a pretend JWT with the AccessToken being the PAT to use that instead of authenticating a JWT and everything seems to be working fine this way since it just concatenates Bearer + accessToken to assemble it before a request is made.

I think it would be a relatively simple change to just use a PAT and refactor the config a bit if we want to swerve this issue. I'll keep tweaking configurations and hacking on both sides to see if I can find the real cause though.

In the meantime at least my management service is back online :)

adasauce commented 23 hours ago

more extra data:

I added support in netbird for using the Bearer "Access Token Type" instead of JWT from zitadel as well, and get the same could not read projectid by clientid error as before. So it's not to do with receiving and passing the jwt access token.

I also tried adding the urn:zitadel:iam:org:project:id:{projectid}:aud scope to the scopes when making the access token request as noted here: https://zitadel.com/docs/guides/integrate/service-users/client-credentials#2-authenticating-a-service-user-and-request-a-token but that also didn't make a difference.

adasauce commented 6 hours ago

I'm getting another chance to look at this today and at this point I'm pretty sure there's some undesired behaviour going on the zitadel side here. I've followed all of the specs A-Z to build this token for a service user from their docs and their examples, but none of them will authenticate.

I think it may have been introduced in a big refactor on their side at 8e0c8393. If I make a small change to the auth flow in zitadel and not assume any client_id request a project request, only checking clientid against projectid when it's of the id>@<org format and continuing on otherwise with the rest of the auth flow, everything works again. I'm going to open up an issue on the zitadel side and see if I can learn some more there.

edit: though there's not much talk on their github issues list about this, I found some folks complaining in discord about service accounts not working with the same error.

adasauce commented 6 hours ago

https://github.com/zitadel/terraform-provider-zitadel/issues/199 I'm seeing the issue pop up in some other places as well. linking for posterity.