microsoft / Microsoft365DSC

Manages, configures, extracts and monitors Microsoft 365 tenant configurations
https://aka.ms/M365DSC
MIT License
1.46k stars 440 forks source link

SPOTenantSettings: Failing to deploy or test the resource's state with latest Graph version #4746

Open ricmestre opened 4 weeks ago

ricmestre commented 4 weeks ago

Description of the issue

When @salbeck-sit added the TenantDefaultTimezone to SPOTenantSettings at first this was working, but it looks like with Graph 2.19 this isn't working properly anymore, the resource can be exported but not deployed or tested afterwards giving this error message

VERBOSE: [REDACTED]:                            [[SPOTenantSettings]SPOTenantSettings] Getting configuration for SPO Tenant
The type initializer for 'Azure.Core.Pipeline.DiagnosticScopeFactory' threw an exception.
+ CategoryInfo          : NotSpecified: (:) [], CimException
+ FullyQualifiedErrorId : Microsoft.Graph.PowerShell.Authentication.Cmdlets.ConnectMgGraph
+ PSComputerName        : localhost

It looks like our old PnP version has dlls that conflict with latest Graph version, but the fix is simple enough, we just need to connect to Graph first before connecting to SPO. I already tried this and it worked so I'll raise a PR with a fix for this.

Microsoft 365 DSC Version

1.24.605.1

Which workloads are affected

SharePoint Online

The DSC configuration

SPOTenantSettings "SPOTenantSettings"
        {
            ApplicationId                                 = "REDACTED";
            ApplyAppEnforcedRestrictionsToAdHocRecipients = $True;
            CertificateThumbprint                         = "REDACTED";
            CommentsOnSitePagesDisabled                   = $False;
            Ensure                                        = "Present";
            FilePickerExternalImageSearchEnabled          = $True;
            HideDefaultThemes                             = $False;
            IsSingleInstance                              = "Yes";
            LegacyAuthProtocolsEnabled                    = $False;
            MarkNewFilesSensitiveByDefault                = "AllowExternalSharing";
            MaxCompatibilityLevel                         = "15";
            MinCompatibilityLevel                         = "15";
            NotificationsInSharePointEnabled              = $True;
            OfficeClientADALDisabled                      = $True;
            OwnerAnonymousNotification                    = $True;
            PublicCdnAllowedFileTypes                     = "CSS,EOT,GIF,ICO,JPEG,JPG,JS,MAP,PNG,SVG,TTF,WOFF";
            PublicCdnEnabled                              = $False;
            SearchResolveExactEmailOrUPN                  = $True;
            SignInAccelerationDomain                      = "";
            SocialBarOnSitePagesDisabled                  = $False;
            TenantDefaultTimezone                         = "(UTC-08:00) Pacific Time (US and Canada)";
            TenantId                                      = "REDACTED";
            UseFindPeopleInPeoplePicker                   = $True;
            UsePersistentCookiesForExplorerView           = $False;
        }

Verbose logs showing the problem

VERBOSE: [REDACTED]:                            [[SPOTenantSettings]SPOTenantSettings] Getting configuration for SPO Tenant
The type initializer for 'Azure.Core.Pipeline.DiagnosticScopeFactory' threw an exception.
+ CategoryInfo          : NotSpecified: (:) [], CimException
+ FullyQualifiedErrorId : Microsoft.Graph.PowerShell.Authentication.Cmdlets.ConnectMgGraph
+ PSComputerName        : localhost

Environment Information + PowerShell Version

N/A
ykuijs commented 2 weeks ago

It looks like this fix doesn't work when you deploy other SPO resources before deploying the SPOTenantSettings. In that case a PNP has already been made and connecting via Graph still fails. Running a Disconnect-PnPOnline first doesn't work unfortunately.

ricmestre commented 2 weeks ago

I had, still have, a lot of things to take care of and forgot to open a new issue for it. Yes it doesn't solve the issue in all cases and I don't know how to solve it any other way if we need to keep using PnP 1.x instead of 2.x because of PS5, I'm pretty sure that updating PnP would solve this.

For my use case I solved it by processing the deployments separately by workload and then when it reaches SPO sort the resources to process SPOTenantSettings first, a big band-aid but it's the only way I have right now to keep this working.

ricmestre commented 2 weeks ago

Oh of course in between each deployment you also need to stop the DscCore provider inside Msft_Providers for the SYSTEM account in order to shutdown all connections to M365 otherwise you'll get the same problem.

ykuijs commented 2 weeks ago

I can reproduce the issue very easy by first running a Connect-PnPOnline and then a Connect-MgGraph. So far I haven't been able to fix a broken session 😢 But not giving up yet!

ricmestre commented 2 weeks ago

Yes pretty much that's what the resource was doing, if you reverse the connections then it works but as you've noticed if there are other SPO resources processed first, or even OD for that matters since it also uses PnP, then you get the same problem again.

I don't think there's an easy fix for this with PnP 1.x, or we could just get rid of retrieving the timezone and connection to Graph wouldn't be required.

EDIT: of course forgot to mention but it should be obvious, even if we remove that connection to Graph and it "works" if you try to process OD and SPO workloads before AAD or Intune you'll get the same problem, we were just lucky so far because the workloads are processed alphabetically and no one noticed the problem before.

ykuijs commented 2 weeks ago

A former team mate is part of the PnP team. Asked him and it sounded familiar to him. Has something to do with the order of loading dll's. The Graph is probably requiring newer ones, so if you first load PnP the older DLL's are loaded that Graph cannot handle.

Solution is to start loading of code in a separate process as described here.

Example:

$ApplicationId = '<id>'
$CertificateThumbprint = '<thumbprint>'
$TenantId = '<tenantname>.onmicrosoft.com'

Start-Job {
    $params = $args[0]
    New-M365DSCConnection -Workload 'PnP' -InboundParameters @{
        ApplicationID         = $params.ApplicationId
        CertificateThumbprint = $params.CertificateThumbprint
        TenantId              = $params.TenantId
    }

    Get-PnPContext
} -ArgumentList @{ApplicationID = $ApplicationId;CertificateThumbprint = $CertificateThumbprint;TenantId = $TenantId} | Receive-Job -Wait

New-M365DSCConnection -Workload 'MicrosoftGraph' -InboundParameters @{ApplicationID = $ApplicationId;CertificateThumbprint = $CertificateThumbprint;TenantId = $TenantId}
Get-MgContext

Since this happens with PnP PowerShell in combination with Graph, I would suggest to move all PnP cmdlets into a Start-Job function. That way, you can use resources in any random order.

ricmestre commented 2 weeks ago

Uh! That's actually quite clever, still a big band-aid if you ask me but much better than having to shuffle resources in a weird way to cope with the problem. I'd say go for it, right now I'm really busy and won't be able to do that, but I also spoke with Nik just about an hour ago and once I'm available I'll start sending several patches for EXO resources that are quite not working.

The reason I eventually found out this problem with SPO was because EXO also has problems if deployments are done next to Graph albeit the problem here is directly with WMI, not actually importing the DLLs in a certain order.

ykuijs commented 2 weeks ago

@ricmestre Why do EXO resources need updating? This has something to do with PnP and we are not using PnP for EXO, right?

When going down the Start-Job route: We need to think about batching requests. Each job is a new process and requires a new authentication towards M365. Would be wise to run multiple commands in one request.

ricmestre commented 2 weeks ago

I'm reviewing EXO workload and a lot of them have several different issues that need patches because they're not working, but another thing is that if I only use EXO resources (that are currently working) in a deployment they work fine but if I mix with Intune then WMI starts spitting errors in random order. Imagine the following example that I'm using in my integration tests, deploying EXO, Intune, O365, OD, SPO and Teams workload I found the problem with PnP and Graph and removed SPOTenantSettings from the deployment so when I tried again and it worked, but in my tests just right after the deployment I call Test-DscConfiguration and this always fails in EXO in random resources, not always the same, and with different WMI errors. Removing Intune from the equation the problems went away so to fix my problem was to make the deployments, and the tests after, separate per each workload.

See #4632 that is one of the errors I got, but I had others sometimes even without error codes just something like Error sending your request

salbeck-sit commented 2 weeks ago

Would you say that the here-and-now solution would be to use Powershell 7 ? I notice that the PnP-module is imported using -UseWindowsPowershell in MSCloudLoginAssistant when using pwsh and in that way isolating the module and it's use of DLLs.

ykuijs commented 2 weeks ago

Unfortunately, that isn't an option. The LCM is still based on PSv5.1 😢

ricmestre commented 2 weeks ago

@ykuijs As I've mentioned the problems I've seen with EXO + Graph were at the deployment stage on WMI events, not while importing the modules. Also this happens with ExchangeOnlineManagement 3.4.0, we had to rollback 3.5.0 since it currently doesn't work with service principals.

Nevertheless I've just seen someone reporting problems importing Graph modules if EXO is imported first (although it's 3.5.0 which we are not using, as explained above) so most likely your workaround for PnP might also need to be adapted for EXO and/or Graph, see https://github.com/microsoftgraph/msgraph-sdk-powershell/issues/2803

salbeck-sit commented 2 weeks ago

Has anyone looked into https://github.com/microsoft/CloudConfigurationManager as an alternative to DSC ?

salbeck-sit commented 2 weeks ago

I can answer that myself, seeing that @NikCharlebois and @andikrueger is on the list of contributors...

salbeck-sit commented 2 weeks ago

CCM doesn't in itself solve the basic problem but it looks much easier to modify Test- and Start-CCMConfiguration to batch resources in separate Powershell jobs and in this way avoid (most) DLL conflicts.

salbeck-sit commented 2 weeks ago

Well, 'much easier' is a stretch since PSJobs don't inherit global vars meaning that each resource is unaware of all others and will have to do its own authentication. This could be alleviated by using runspaces and synchronized hashtables like this article indicates: https://learn-powershell.net/2013/04/19/sharing-variables-and-live-objects-between-powershell-runspaces/ Such a change means that MSCloudLoginAssistant would need a rewrite wrt global vars as will everywhere else the global vars are referenced. Not sure how such a change would impact use with DSC

CCM could be modified to manage runspaces and 'shared variables' based on parameters in order to allow it to still be used with other DSC-resources.

andikrueger commented 2 weeks ago

CCM follows a different approach to remove the dependency on LCM. It will handle a configuration as such and make the necessary calls into the test and set functions of the resources.

From a DLL loading perspective and the description of the issues above it will make no difference as long as everything would be run in the same run space.

Using jobs might resolve this one but will cause endless sign-ins as long as the sessions cannot be shared.

Having one run space per workload might solve this issue all together.

This can also be achieved by a configuration that is split into workloads and executed on different host. Using this kind of approach allows parallelism and reduces runtime. Furthermore it allows for more flexibility in configuration and separation of credentials.

ricmestre commented 2 weeks ago

@andikrueger That's exactly my current workaround for this, I split my build and deployment scripts into separate workloads, whatever ones might be configured, but in my case I run them all in sequence always in the same host (using MSHosted pipeline containers so they are actually assigned random hosts per each run).

ykuijs commented 2 weeks ago

@ricmestre Really interested in how you have implemented this. I am in the final stages of updating the new version of the M365DSC whitepaper. Your ideas might be a good addition for vNext. Could you please share more details?

ricmestre commented 2 weeks ago

I have a build script that iterates per each tenant, and then per each workload, which converts individual policies in markdown to dsc format in a single blueprint and then compile it to MOF. That blueprint will only contain policies that were modified in a pull request so that during deployment I don't have to re-deploy all policies of the customer, the PR itself is also validated (with branch policy enforced) by ensuring that the change done to the markdown works, first by running a linter that I've created that checks for instance if the property being changed is an integer and you used a string, then if it's able to convert it to dsc format and if that converted blueprint can be compiled into MOF. I even go the extra mile and call Test-DscConfiguration on that MOF applied to the production tenant to ensure that at least the function Get-TargetResource of the resources being changed are working because I've seen in the past some problems even when getting. If all these tests are successful then the PR is good to be reviewed. After the PR being approved then a build pipeline runs which essentially does the same as the PR validation one except that it actually writes back the changes to the repo and it doesn't call Test-DscConfiguration.

In the deploy script then I will have individual MOF files separated by tenant which have their own folder and the MOFs to be deployed are called localhost-name_of_workload.mof, that I copy to localhost.mof to be processed and between each workload I have a sleep for 30s which I've seen helped in some scenarios where if they are processed too fast the deployment might fail. I also have integration tests for New/Update/Remove so while running the tests the mofs are actually called localhost-name_of_workload-test_mode.mof and additionally call Test-DscConfiguration to ensure that the resource is actually working fine in Set-TargetResource which is not always the case and that's how I usually find these bugs I report and patch.

ykuijs commented 1 week ago

Based on the PR of @ricmestre, @hvdbrink discovered that connecting with the Graph in the first resources that uses PnP also solves this issue. Since any of the resources that uses PnP can be the first resource, this means we have to add a Connect to Graph to every PnP resource, even when that resource doesn't use Graph. Not an ideal situation, but I would say definitely a quick and easy workaround for now. Really interested in your feedback of this workaround.

A more permanent solution could be the use of runspaces. Below some code for using shared variables across runspaces. I haven't tested this yet with connections to PnP, that is be a good next test:

$hash = [hashtable]::Synchronized(@{})
$hash.One = 1

Write-host ('Value of $Hash.One before background runspace is {0}' -f $hash.one) -ForegroundColor Green -BackgroundColor Black
$runspace = [runspacefactory]::CreateRunspace()
$runspace.Open()

$runspace.SessionStateProxy.SetVariable('Hash',$hash)

$powershell = [powershell]::Create()
$powershell.Runspace = $runspace

$powershell.AddScript({
    $hash.one++
}) | Out-Null

$handle = $powershell.BeginInvoke()
While (-Not $handle.IsCompleted) {
    Start-Sleep -Milliseconds 100
}

Write-host ('Value of $Hash.One after first execution in background runspace is {0}' -f $hash.one) -ForegroundColor Green -BackgroundColor Black

$powershell.Commands.Clear()

$powershell.AddScript({
    $hash.one += 20
}) | Out-Null

$handle = $powershell.BeginInvoke()
While (-Not $handle.IsCompleted) {
    Start-Sleep -Milliseconds 100
}

$powershell.EndInvoke($handle)
$runspace.Close()
$powershell.Dispose()

Write-host ('Value of $Hash.One after second execution in background runspace is {0}' -f $hash.one) -ForegroundColor Green -BackgroundColor Black
salbeck-sit commented 1 week ago

Given that the issue is the load-order of modules, here's a simple solution in MSCloudLoginAssistant for the PnP-workload:

(start of function)
if ($Global:MSCloudLoginConnectionProfile.PnP.Connected)
{
    Write-Verbose 'Already connected to PnP, not attempting to authenticate.'
    return
}

#Check if Graph-module is loaded and, if not, explicitly load before PnP
if (-not (Get-Module Microsoft.Graph.Authentication -ErrorAction SilentlyContinue))
{
    write-verbose "Explicit import of PS-module Microsoft.Graph.Authentication"
    Import-Module Microsoft.Graph.Authentication -ErrorAction SilentlyContinue
}

The last ErrorAction is for the fringe-case that the config doesn't contain any Graph-resources This could also be used with EXO as I've seen reports that the same issue has been noticed here.

ykuijs commented 1 week ago

That sounds like a more elegant, but still temporary solution. @salbeck-sit Have you tested that importing the module is the resolution instead of connecting to Graph?

salbeck-sit commented 1 week ago

I haven't tested it as such but the common issue is that each module contains a list of DLLs and the first version of a DLL loaded 'wins' - meaning that a subsequent load of a DLL with the same name is ignored, regardless of version. At least for importing Microsoft.Graph.Authentication, a few DLLs that are listed in the Dependencies folder of the module are loaded, most notably Azure.Core.DLL, provided that they weren't previously loaded. The same goes for PnP though the list of DLLs is way longer. So calling Connect-MgGraph shouldn't be necessary. Simply ensuring that the version supported by Graph is loaded first by explicitly importing the module should be all that's needed.

ricmestre commented 1 week ago

Yes, the actual connect, or calling any other cmdlets for that matters, is not important in this case, what we need to ensure is that the modules are loaded in the correct order because the first one wins as @salbeck-sit mentioned.

ykuijs commented 1 week ago

Have tested if importing is sufficient and you are absolutely right 😉 Will submit a PR for the MSCloudLoginAssistant to implement this workaround

ykuijs commented 1 week ago

Created PR and requested @NikCharlebois to review and merge: https://github.com/microsoft/MSCloudLoginAssistant/pull/175