microsoft / finops-toolkit

Tools and resources to help you adopt and implement FinOps capabilities that automate and extend the Microsoft Cloud.
https://aka.ms/finops/toolkit
MIT License
304 stars 105 forks source link

Optimization Engine - Managed Identity perms - causing various issue I suspect #995

Closed psilantropy closed 1 month ago

psilantropy commented 1 month ago

Deployed AOE yesterday and having a few strange issues. Not sure if related to timing, or adjusting schedules or not.

AzureOptimizationAADObjectsV1_CL doesn't exist in my workspace is the issue I guess. Which I think is related to the managed identity not having the correct permissions - even though they are granted through the cli as documented.

🐛 Problem

Some of my workbooks can only see the subscription where they were deployed along with a few other issues.

  1. Resources Inventory - one sub in drop down
  2. Resources Inventory - appears to be one sub
  3. Identities and Roles - Query could not be parsed at ')' on line [3,26]error
  4. Recommendations - one sub in drop down
  5. Reservations Usage - resource type drop down <query failed> error
  6. Costs Growing - Empty
  7. Block Blob Storage Usage - All subscriptions in drop down, but only selecting my deployment sub produces any results for the queries. Currently is EUR but should be NZD as per settings applied.
  8. Policy Compliance - Only one subscription in drop down
  9. Benefits Simulation - Appears to be one subscription. Error present 'where' operator: Failed to resolve table or column expression named 'AzureOptimizationPricesheetV1_CL'...
  10. Savings Plan Usage - Error 'where' operator: Failed to resolve table or column expression named 'AzureOptimizationPricesheetV1_CL'...

đŸ‘Ŗ Repro steps

If I run az role assignment list against the managed identity object ID, I just see the scope is the subscription as reader. This enterprise reader role may not be visible at this level however since it must be programmatically assigned. ?

Under enterprise apps > permissions. Should I expect any consents granted here for the entra workbooks? This is empty No admin consented permissions found for the application

I have manually granted reader over the root management group to cover all subs and the Global Reader role.

Looking through the automation account I have a few failed jobs.

AADExpiringCredentialsToBlobStorage

Query failed. Debug the following query in the AOE Log Analytics workspace: let expiryInterval = 30d; let AppsAndKeys = materialize (AzureOptimizationAADObjectsV1_CL | where TimeGenerated > ago(1d) | where ObjectType_s in ('Application','ServicePrincipal') | where ObjectSubType_s != 'ManagedIdentity' | where Keys_s startswith '[' | extend Keys = parse_json(Keys_s) | project-away Keys_s | mv-expand Keys | evaluate bag_unpack(Keys) | union ( AzureOptimizationAADObjectsV1_CL | where TimeGenerated > ago(1d) | where ObjectType_s in ('Application','ServicePrincipal') | where ObjectSubType_s != 'ManagedIdentity' | where isnotempty(Keys_s) and Keys_s !startswith '[' | extend Keys = parse_json(Keys_s) | project-away Keys_s | evaluate bag_unpack(Keys) ) ); let ExpirationInRisk = AppsAndKeys | where EndDate < now()+expiryInterval | project ApplicationId_g, KeyId, RiskDate = EndDate; let NotInRisk = AppsAndKeys | where EndDate > now()+expiryInterval | project ApplicationId_g, KeyId, ComfortDate = EndDate; let ApplicationsInRisk = ExpirationInRisk | join kind=leftouter ( NotInRisk ) on ApplicationId_g | where isempty(ComfortDate) | summarize ExpiresOn = max(RiskDate) by ApplicationId_g; AppsAndKeys | join kind=inner (ApplicationsInRisk) on ApplicationId_g | summarize ExpiresOn = max(EndDate) by ApplicationId_g, ObjectType_s, DisplayName_s, Cloud_s, KeyType, TenantGuid_g | order by ExpiresOn desc

Trying to run the above int he workspace results in;

'where' operator: Failed to resolve table or column expression named 'AzureOptimizationAADObjectsV1_CL'

Recommend-UnusedAppGWsToBlobStorage

Similar error to above. Seems to be related to AzureOptimizationConsumptionV1_CL

Standard warning on most of my jobs. I presume this is fine.

TenantId 'xxxxxxxxxxxxxxxxxxx' contains more than one active subscription. First one will be selected for further use. To select another subscription, use Set-AzContext. To override which subscription Connect-AzAccount selects by default, useUpdate-AzConfig -DefaultSubscriptionForLogin 00000000-0000-0000-0000-000000000000. Go to https://go.microsoft.com/fwlink/?linkid=2200610 for more information.

📷 Screenshots

image

ℹī¸ Additional context

EA agreement. 30+ subscriptions. I have Enrollment Administrator, Global Administrator. AzureOptimization_ConsumptionScope = BillingAccount

psilantropy commented 1 month ago

I have completed a partial upgrade, but now I'm just going to redeploy (upgrade). See if that makes any difference. Almost the end of the day so I could see results tomorrow.

psilantropy commented 1 month ago

image

Redeploy gives the same feedback as per the manual script to grant perms. AzureOptimizationAADObjectsV1_CL still not present.

psilantropy commented 1 month ago

Checking exceptions on the runbook I have;

The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: Insufficient privileges to complete the operation. Status: 403 (Forbidden) ErrorCode: Authorization_RequestDenied

I manually started this runbook. It looks like it's working now and exporting to blob.

The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: Exception of type 'System.OutOfMemoryException' was thrown.

Doh. (15k+ users, quite a large entra env.)

helderpinto commented 1 month ago

Hi, @psilantropy . Thanks for reporting this issue and for the detailed info - it helps a lot! It seems you have made some progress as per the screenshot you shared above - all the required permissions in Entra ID and in your EA/MCA were already granted, which means that some of the issues you should be resolved. Now, let's provide an answer to the problems you listed above:

  1. You must grant the AOE automation account identity the Reader role to other subscriptions or MGs in your environment. See here.
  2. Same as 1.
  3. The last error you got means AOE has the required permissions, but as your environment is large, data collection does not fit into the available memory in the automation sandbox. There are two possible work-arounds: 1) implement an Azure Automation Hybrid Worker and reconfigure AOE to use the Hybrid Worker (see details here); 2) filter the Entra ID users and groups, by creating the AzureOptimization_AADObjectsUserFilter and AzureOptimization_AADObjectsGroupFilter automation variables with an MS Graph OData filter.
  4. Same as 1.
  5. As the AOE identity seems to have the required permissions this might caused by either 1) permissions were not granted at the time the Export-ReservationsUsageToBlobStorage runbook ran and you have to wait for the next day; or 2) your company does not have any Reservations bought.
  6. This is expected, because AOE does not have enough historical data to show costs growing anomalies. Give it at least 1 week.
  7. Same as 1. You see all subscriptions in the dropdown because it is using your permissions to build the dropdown. Regarding the NZD vs EUR, unfortunately, the deployment script is not replacing the currency (something to improve). You'll have to update the currency manually and save the workbook.
  8. Same as 1.
  9. As the AOE identity seems to have the required permissions this might caused by the permissions not having been granted at the time the Export-PriceSheetToBlobStorage runbook ran and you have to wait for the next week (I can explain a work-around if you need that workbook operating correctly earlier).
  10. Same as 9.
psilantropy commented 1 month ago

Hi @helderpinto, thanks for the great response.

1,2,4,7 = Reader granted over our root management group. 3 = Thanks. Will look into completing this hybrid worker today. 5 = Yes I suspect this is a timing issue. Will wait and see. We have about 30 res and 1 sp. 6 = Great :) 9 ,10 = Happy to wait. I suspect I did have a delay on the perms, and then completed the partial upgrade flag after getting permissions. Then later deployed without the flag again.

Before I left work I set the schedule so I'd have a freshrunbook job run to review today. Quite a few failed, so i'll get that worker in place and go from there.

Some misc things after a quick review this morning, but don't look too much into it. I'll get the above sorted first;

Most runbook errors related to AzureOptimizationConsumptionV1_CL One was Export-ReservationsUsageToBlobStorage which has the error; Billing Account ID undefined. Use either the AzureOptimization_BillingAccountID variable or the BillingAccountID parameter This variable is definitely set correctly.

Both these jobs are sitting at suspended (2.5 hrs) Export-AADObjectsToBlobStorage Export-PolicyComplianceToBlobStorage

psilantropy commented 1 month ago

Looks like time cured my one-sub in dropdown issues. :)

psilantropy commented 1 month ago

Things starting to look a bit better this afternoon. Just a few workbooks don't have all VMs across the subscriptions. Hopefully after the weekend it's all in place.

@helderpinto two hopefully quick questions if you had time.

1: AA Variable: AzureOptimization_RightSizeAdditionalPerfWorkspaces. Does setting this variable change the workspace specified? They way I interpreted the documentation was that it only pulled data and didn't add tables / make changes. I'm hoping the second question added those tables, but I suspect it was setting this variable.

2: I have also discovered our environment has an old version (only has 3 workbooks for example) of AOE deployed by an external contractor. It's not properly set up, and appears to be partially working. Will this impact my new AOE deployment at all? Same tenant, same subscription, different resource group and resources/sql/etc. I will probably delete this second instance of AOE.

Liking it so far. Great work.

helderpinto commented 1 month ago

@psilantropy, thanks for the comments. I am glad it is looking better. You need some patience with AOE - it's a packhorse, not a racehorse :-)

Now your questions: 1: This variable is only required if you have VMs sending guest OS Perf metrics (with the help of the AMA agent) to other LA workspaces. By adding those workspace IDs, you'll improve the fit score accuracy for the augmented Advisor right-size recommendations. Rest assured this variable does not make any change to the tables in the AOE workspace and does not compromise the overall AOE health.

2: Multiple AOE deployments in the same tenant should not impact each other, but as you are duplicating data collection, maybe it's better to remove the old instance.

psilantropy commented 1 month ago

1: Thanks. I have tables named the same in our primary workspace, where I pointed the variable to. Since this has some perf data already. Maybe this was a previous deployment doing this. 2: Planning to delete soon :).

Already made some gains from the PowerBI report, thank you.

psilantropy commented 1 month ago

Just confirming that yes the old implementation used our existing workspace, and sent tables there. So nothing to do with my secondary workspace like you said. I might need to figure out how to decom that correctly and how to handle those tables.

Things seem ok now. I had a few issues after enabling a hybrid worker, but was ps module problems. Closing this off. Anything new will be a new issue and specific. Thanks for your help :)