Open stevehipwell opened 2 years ago
Lets move this to objstore. I can get this done when I have time. Right now AKS workload identity is still in public preview, so it has not been high enough prio for me. This will most likely only require some docs changes and maybe some configuration changes, as we have already made the migration to the new SDK.
@phillebaba the public preview part is the direct AKS integration via a cluster option, the Azure AD Workload Identity technology looks to be GA and has been recommended over Azure AD Pod Identity (which is now deprecated) for a relatively long time.
Based on the implementations I've seen there is a code change required but it doesn't look to be too significant.
Yes sorry what I mean was the AKS integration. https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster
So the deprecation of AAD Pod Identity has more been used a signal to the community that no new features will be accepted into the project. Especially because the same people working on that project are also working on the new solution. However security patches will still be provided for a longer time.
The major reason we have not seen greater adoption of workload identities has been for the fact that it has not supported managed identities until recently. The reasons for this seemed to be due to technical limitations at the time in Azure. From my perspective there has been no point to switch until manage identity support was resolved which has just happened. https://github.com/Azure/azure-workload-identity/issues/325
I will try to get to this before the next release of Thanos.
Thanks @phillebaba, that would be great. You're absolutely right that this hasn't worked until very recently (even more recently if you wanted to use Terraform). The reason I'm so keen to get it working now is that the current Azure AD Pod Identity solution causes significant friction due to various defects and this is the path for us to remove that whole component from our AKS clusters.
Hi @phillebaba
I am new to open source contribution and I would like to work on this issue. Can I go ahead and work on this ?
I had a look at this just now, and honestly I am wondering if this is already working and all we need are some docs. I did the whole SDK upgrade during the summer for this reason. Has anyone actually tried to run Thanos with Azure Workload Identity in AKS? All required configuration is injected as environment variables to the Pod so the MSI configuration should just pick it up. I don't have time right now so if you want to you could maybe verify if it works @RagibAjmal and get back with the results.
Sure I will work on it and get back with the results
Hi @phillebaba ,
Please correct/guide me if I am wrong
I tried to access my azure storage container from thanos store using storage_account_key
and the pod works fine, but when I try to use user_assigned_id
(clientID of Managed Identity) instead of storage_account_key
, the pod goes to CrashLoopBackOff and I get the below log from the pod.
level=debug ts=2022-11-20T20:09:49.183966621Z caller=main.go:67 msg="maxprocs: Leaving GOMAXPROCS=[2]: CPU quota undefined" level=info ts=2022-11-20T20:09:49.18435653Z caller=factory.go:52 msg="loading bucket configuration" level=debug ts=2022-11-20T20:09:49.184471532Z caller=azure.go:139 msg="creating new Azure bucket connection" component=store level=error ts=2022-11-20T20:09:49.195210169Z caller=main.go:135 err="===== INTERNAL ERROR =====\nManagedIdentityCredential authentication failed\nGET http://169.254.169.254/metadata/identity/oauth2/token\n--------------------------------------------------------------------------------\nRESPONSE 400 Bad Request\n--------------------------------------------------------------------------------\n{\n \"error\": \"invalid_request\",\n \"error_description\": \"Identity not found\"\n}\n--------------------------------------------------------------------------------\n\nAzure API return unexpected error: *azblob.InternalError\n\ngithub.com/thanos-io/objstore/providers/azure.NewBucketWithConfig\n\t/go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20221006135717-79dcec7fe604/providers/azure/azure.go:170\ngithub.com/thanos-io/objstore/providers/azure.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20221006135717-79dcec7fe604/providers/azure/azure.go:150\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20221006135717-79dcec7fe604/client/factory.go:70\nmain.runStore\n\t/app/cmd/thanos/store.go:254\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20221006135717-79dcec7fe604/client/factory.go:87\nmain.runStore\n\t/app/cmd/thanos/store.go:254\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\ncreate bucket client\nmain.runStore\n\t/app/cmd/thanos/store.go:256\nmain.registerStore.func1\n\t/app/cmd/thanos/store.go:198\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\npreparing store command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"
I can find the error is from objstore/providers/azure/azure.go so I tried to findout what happend by running in local
the function getContainerClient
in helpers.go
returns value when I provide user_assigned_id
but the function containerClient.GetProperties
prints the below value in my local
{{{
} }} ===== INTERNAL ERROR ===== ManagedIdentityCredential: Get "http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&client_id=**I_am_hiding_this_value**&resource=https%3A%2F%2Fstorage.azure.com": dial tcp 169.254.169.254:80: i/o timeout
Here is what i get when i use in storage_account_key
my local
{{{
2022-11-20 20:14:40 +0000 GMT 0x140000120d0 0x14000018237 0x14000012050 0x14000018235 0x14000018236 0x14000018238 2022-11-19 10:27:50 +0000 GMT 0x14000012070 0x14000012080 map[] 0x140000120a0 0x140000120b0} 0x14000078000}}
Also, I am just going through the tutorials of workload Identity and I find that a webhook is inserting the clientID into the pod as a env var named AZURE_CLIENT_ID
using ServiceAccount. I tried using the AZURE_CLIENT_ID
using sa and webhooks but still I am getting the CrashLoopBackOff. I tried finding if we are getting the value AZURE_CLIENT_ID
in objstore but I can only find we are getting the value user_assigned_id
through secrets.
here is the configuration using sa and webhooks in the pod
Environment: AZURE_CLIENT_ID: hidden by myself AZURE_TENANT_ID: hidden by myself AZURE_FEDERATED_TOKEN_FILE: /var/run/secrets/azure/tokens/azure-identity-token AZURE_AUTHORITY_HOST: https://login.microsoftonline.com/
Sounds like the SDK is not picking up the AZURE_CLIENT_ID parameter by itself which is a shame. I need to look further into the SDK to figure out how it is expected to be configured.
Ok so I have totally missed a detail in the SDK. Federated tokens are not supported out of the box in the latest Go SDK. It will be available in January hopefully. In the meantime I will implement the code snippet example to get support working. https://github.com/Azure/azure-sdk-for-go/issues/15615#issuecomment-1211012677
Hi @phillebaba
Can I work on this
Yeah sure, I would suggest you branch from thanos-io/objstore#35 as it contains a couple of SDK changes caused by the upgrade. It would be great if you could verify that token renewal actually works, as it is not stated. If I remember correctly the token renews every hour.
@phillebaba federated token gets updated once an hour, an access token - every 24h, so the end solution requires at least 25h of testing.
@phillebaba @weisdd. Could you please help me with how can we test the application or how/where do you test the application. My free subscription ended few days before , so I need to pay for kubernetes cluster. I cant find any playground for 25+hours. I am not able to test the application in minikube too as I am unable to configure workload identity on it. I dont know any other way. Please help me with if there are any other way
@RagibAjmal My approach is simple:
Even if you cannot afford to pay the cloud bills, there's a way for you to go forward. - While testing the token renewal process, you don't actually need to have a kubernetes cluster. You can write a comprehensive test suite with a test web server, whose responses would emulate token expiration, and you just have to make sure an updated federated token is sent over to your web server. In case of adal, it was enough to just make an explicit call to .Refresh
method (you can see it in my PR for cert-manager), the new library is likely to have something similar, it must be fun to research.
@phillebaba I've setup WI on my cluster with thanos now that it's GA. It doesn't seem to work:
level=error ts=2023-05-10T07:28:01.723782864Z caller=main.go:135 err="ManagedIdentityCredential authentication failed\nGET http://169.254.169.254/metadata/identity/oauth2/token\n--------------------------------------------------------------------------------\nRESPONSE 400 Bad Request\n--------------------------------------------------------------------------------\n{\n \"error\": \"invalid_request\",\n \"error_description\": \"Identity not found\"\n}\n--------------------------------------------------------------------------------\nTo troubleshoot, visit https://aka.ms/azsdk/go/identity/troubleshoot#managed-id\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/go/pkg/mod/github.com/thanos-io/objstore@v0.0.0-20230201072718-11ffbc490204/client/factory.go:87\nmain.runCompact\n\t/app/cmd/thanos/compact.go:205\nmain.registerCompact.func1\n\t/app/cmd/thanos/compact.go:94\nmain.main\n\t/app/cmd/thanos/main.go:133\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594\npreparing compact command failed\nmain.main\n\t/app/cmd/thanos/main.go:135\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594"
I think the issue might be that neither DefaultAzureCredential nor FederatedCredential is created but rather the ManagedIdentityCredential is created directly, which is not a valid credential for workload identity. That credential signals the use of managed identity using the IMDS, instead of the federated token exchange endpoint used by workload identity.
@rouke-broersma azidentity v1.3.0 was released two days ago, with official support for workload identity. As soon as I get some time over I will upgrade the package and test it. Unless someone else has time right now to do it.
@phillebaba should I also open a new issue in thanos-io/objstore/ to track the work?
@fpetkovski @GiedriusS could you move this issue to objstore?
If that is not possible it would be better to create a new issue.
Good idea, moved.
@phillebaba @fpetkovski don't we need an issue in both repos? This repo needs an issue to track the actual work while the main Thanos repo needs an issue to track when Azure AD Workload Identity support has been added into Thanos.
Any update on this?
@phillebaba Any further updates on this? AAD Pod Identity is now end of life so we now have to use Thanos with the Azure Workload Identity Sidecar (which also won't be supported long-term).
PR has been created here #82 to resolve this issue. Just need someone to review!
Isn't this implemented with #82 now?
Yeah i think this issue can be closed! It can be tested with 0.33.0-rc.0 already
Is your proposal related to a problem?
The currently supported Azure AD Pod Identity is deprecated in favour of the new Azure AD Workload identity.
Describe the solution you'd like
I'd like Thanos to support Azure AD Workload Identity.
Describe alternatives you've considered
n/a
Additional context
The following 2 PRs are for adding this support to other projects and might help with the required changes.