slashdevops / idp-scim-sync

Keep your AWS Single Sign-On (SSO) groups and users in sync with your Google Workspace directory
Apache License 2.0
93 stars 20 forks source link

feat: Option to never delete group (upsert only) #70

Open guilhermeblanco opened 2 years ago

guilhermeblanco commented 2 years ago

Is your feature request related to a problem? Please describe. I'm always frustrated when Google APIs decide to barf and exclude several groups on AWS SSO side. Google APIs have failed multiple times in the last 5 weeks. When it does, it consistently returns 502 for several minutes (idp-scim-sync handles it nicely!) and when it returns, for another minute it wildly provide a subset of existent groups, and then it returns back to normal.

When the recovery happens, all AWS SSO groups are re-created, but the permissionSets on AWS SSO reference to previously existent groups, forcing us to remap the group to account with permission all over again. This process takes ~20 minutes via Terraform, since the permissionSet also creates new IAM roles that need to be correlated to EKS roles.

Describe the solution you'd like Introduce a new flag, preventing created groups to be removed, only to be added or modified.

Describe alternatives you've considered An alternate solution would be in the case of a large diverging number of groups (when comparing state file groups to returned Google groups), attempt to re-fetch the groups up to 3 times, and then proceed with operation.

jdonohoo commented 2 years ago

Yes please... this has happened to us 2-3 times in the last 45 days or so

christiangda commented 2 years ago

hi @guilhermeblanco @jdonohoo sorry to hear about that.

I'll see if I can implement some workaround to this google API problem. The thing here is the consistency, I mean I created this tool because of the consistency, I mean from the point of the state file this need to maintain the exactly google workspace state, if now I change this to be inconsistency when is going to be consistent again? I don't know if you understand me?

Google Workspace is the source of truth and it is what makes sense to the program, so I think if the source of truth fails what you need to do is talk with them and open a case to put them in the context of your problem.

But anyway, let me see if I can do something to mitigate this one.

jdonohoo commented 2 years ago

@christiangda I understand the concept of the state file and workspaces being the source of truth, but APIs do go down. When the ICP sync lambda runs if there is a google outage / error instead of doing nothing... It is wiping out the group that error'd.

So the source of truth is failing to be the source of truth, so it deletes Groups from AWS SSO. When Google stops failing it recreates the group, and all those AWS Account Assignments have the old group principalId that no longer exist so your users lose all permissions.

Then the new group that was created has zero Account Assignments so you have to go reattach.

I ended up writing a powershell script that is running as a github action cron job top clean up orphan'd account assignments and rebuild them for all the aws accounts.

I removed obvious secrets, but you should get the general idea here:

#Add all the AWS Account Numbers you want to make sure they have permissions
$awsAccounts = New-Object "System.Collections.Generic.Dictionary[String,String]";
$awsAccounts.Add("dev", "AWS_ACCOUNTNUMBER");

#Your Workspace Group(s) that you are syncing
$ssoGroups = @("Dev", "Ops");

$ssoPrincipals = New-Object "System.Collections.Generic.Dictionary[String,String]";

$instanceARN = "arn:aws:sso:::instance/ssoins-YOURVALUEHERE";
$permissionARN = "arn:aws:sso:::permissionSet/ssoins-YOURVALUEHERE";
$identityStore = "d-SOMEVALUE";

#Local Dev vs GitHub Action that will use the AWS Credentials Prepare Step
if ($env:OS -eq "Windows_NT") {
    $env:AWS_PROFILE = "master";
}

function Execute-Command ($commandTitle, $commandPath, $commandArguments) {
    Try {
        $pinfo = New-Object System.Diagnostics.ProcessStartInfo
        $pinfo.FileName = $commandPath
        $pinfo.RedirectStandardError = $true
        $pinfo.RedirectStandardOutput = $true
        $pinfo.UseShellExecute = $false
        $pinfo.Arguments = $commandArguments
        $p = New-Object System.Diagnostics.Process
        $p.StartInfo = $pinfo
        $p.Start() | Out-Null
        [pscustomobject]@{
            commandTitle = $commandTitle
            stdout       = $p.StandardOutput.ReadToEnd()
            stderr       = $p.StandardError.ReadToEnd()
            ExitCode     = $p.ExitCode
        }
        $p.WaitForExit()
    }
    Catch {
        exit
    }
}

function Write-Log($logMessage) {
    $ts = (Get-Date).ToString("yyyy-dd-MM hh:mm:ss tt")

    Write-Host $ts "|" $logMessage
}

function Main() {

    #Get The Active Group Principals

    foreach ($group in $ssoGroups) {    
        Write-Log "Getting Principal for $group..."
        $command = Execute-Command -commandTitle "Get Principal" -commandPath "aws" -commandArguments "identitystore list-groups --identity-store-id $identityStore --filter AttributePath=DisplayName,AttributeValue=$group"
        $json = $command.stdout | ConvertFrom-Json
        $principal = $json.Groups[0].GroupId
        Write-Log $principal

        $ssoPrincipals.Add($group, $principal);

    }

    #Look Up the AccountAssignment foreach Account
    foreach ($key in $awsAccounts.Keys) {
        $accountNumber = '';
        $result = $awsAccounts.TryGetValue($key, [ref]$accountNumber) 
        Write-Log "$accountNumber|$key"

        $command = Execute-Command -commandTitle "Get Account Assignments" -commandPath "aws" -commandArguments "sso-admin list-account-assignments --instance-arn $instanceARN --account-id $accountNumber --permission-set-arn $permissionARN"
        $json = $command.stdout | ConvertFrom-Json

        #Write-Log $command.stderr
        #Write-Log $command.stdout

        foreach ($assignment in $json.AccountAssignments) {
            $principalId = $assignment.PrincipalId
            $principalType = $assignment.PrincipalType
            Write-Log "$principalType | $principalId"

            #Write-Log $assignment
            #Remove AccountAssignments that have PrincipalIds that don't match active group principalIds
            if (!$ssoPrincipals.ContainsValue($principalId)) {
                Write-Log "$principalId should be killed with fire"
                $command = Execute-Command -commandTitle "Delete Account Assignment" -commandPath "aws" `
                    -commandArguments "sso-admin delete-account-assignment --instance-arn $instanceARN --target-id $accountNumber --target-type AWS_ACCOUNT --permission-set-arn $permissionARN --principal-type GROUP --principal-id $principalId"
                Write-Log $command.commandTitle + " for $principalId"
            }

        }

        foreach ($k in $ssoPrincipals.Keys) {
            #Write-Log $k

            $checkPrincipal = '';
            $result = $ssoPrincipals.TryGetValue($k, [ref]$checkPrincipal)

            #Write-Log $checkPrincipal

            $checkAssignment = $json.AccountAssignments | where { $_.PrincipalId -eq $checkPrincipal }

            #Write-Log "check: $checkAssignment"

            #Create AccountAssignments for the Active PrincipalIds if they don't exist
            if ($null -eq $checkAssignment) {
                Write-Log "$accountNumber missing $checkPrincipal"
                $command = Execute-Command -commandTitle "Create Account Assignment" -commandPath "aws" `
                    -commandArguments "sso-admin create-account-assignment --instance-arn $instanceARN --target-id $accountNumber --target-type AWS_ACCOUNT --permission-set-arn $permissionARN --principal-type GROUP --principal-id $checkPrincipal"
            }
        }

    }

    ### WINNING ###

    #Don"t want this set on my terminal outside of execution
    if ($env:OS -eq "Windows_NT") {
        $env:AWS_PROFILE = "";
    }
}

Main;
guilhermeblanco commented 2 years ago

hi @guilhermeblanco @jdonohoo sorry to hear about that.

I'll see if I can implement some workaround to this google API problem. The thing here is the consistency, I mean I created this tool because of the consistency, I mean from the point of the state file this need to maintain the exactly google workspace state, if now I change this to be inconsistency when is going to be consistent again? I don't know if you understand me?

Google Workspace is the source of truth and it is what makes sense to the program, so I think if the source of truth fails what you need to do is talk with them and open a case to put them in the context of your problem.

But anyway, let me see if I can do something to mitigate this one.

Hi @christiangda,

I completely understand your point of view. Google is source of truth, and in an ideal world it can be 100% trusted. But unfortunately, system designs cannot 100% uptime and accuracy, so that's why they are 99.99% reliable (Google Directory API is 4, not 5 9's). What I see from the tool is that it is a synchronization tool, not a 1:1 synchronization tool. Otherwise, flags like group filtering, user filtering should not be present. Since this defeats the 1:1 parity, introducing a group-upsert-only flag is the least worst option.

I want to illustrate what exactly happens during a Google outage, so you understand how painful it is to recover and why flag wouldn't hurt so much to prevent this chaos. Apologies for this being so long in the past, but it was the one we went further down investigating. Here is the internal incident report notes I want to share:

Google Admin SDK - Directory API failed between 2022-04-07 20:52 and 2022-04-07 20:59 UTC throwing 503 errors. When it returned at 2022-04-07 20:59:19 UTC, it returned an inconsistent amount of both groups and users, causing the ssosync to remove 66 groups and 4 users in AWS SSO. When Google API went back to normal at 2022-04-07 21:01:41 UTC, it recreated 66 groups and all group memberships again, but these groups were new ones and they were not attached/provisioned to individual AWS accounts. Attached are the logs that clearly show Google failure and what happened internally.

2022-04-07 20:52:46.018 START RequestId: 7b5def82-1eaa-446b-b4fd-eb73abac643e Version: $LATEST 
2022-04-07 20:52:46.022 time="2022-04-07T20:52:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleAdminEmail-XXXXXX" 
2022-04-07 20:52:46.082 time="2022-04-07T20:52:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleCredentials-XXXXXX" 
2022-04-07 20:52:46.119 time="2022-04-07T20:52:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMAccessToken-XXXXXX" 
2022-04-07 20:52:46.151 time="2022-04-07T20:52:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMEndpointUrl-XXXXXX" 
2022-04-07 20:52:46.170 time="2022-04-07T20:52:46Z" level=info msg="getting Identity Provider data" group_filter="[name:SSO* name:VPN*]" 
2022-04-07 20:52:46.170 time="2022-04-07T20:52:46Z" level=info msg="starting sync groups" codeVersion=v0.0.9 
2022-04-07 20:52:47.951 time="2022-04-07T20:52:47Z" level=warning msg="skipping member because is a group, but group members will be included" email=some_group@example.com id=02iq8gzs3n7dlbg 
2022-04-07 20:52:48.030 time="2022-04-07T20:52:48Z" level=warning msg="skipping member because is a group, but group members will be included" email=some_group@example.com id=02iq8gzs3n7dlbg 
2022-04-07 20:52:49.064 Error: cannot sync groups and their members: error getting groups members: idp: error getting group members: idp: error listing group members: googleapi: Error 503: The service is currently unavailable., backendError 
2022-04-07 20:52:49.065 cannot sync groups and their members: error getting groups members: idp: error getting group members: idp: error listing group members: googleapi: Error 503: The service is currently unavailable., backendError: withStack null 
2022-04-07 20:52:49.065 -v, --version                           version for idpscim 
2022-04-07 20:52:49.065 -m, --sync-method string                Sync method to use [groups] (default "groups") 
2022-04-07 20:52:49.065 -l, --log-level string                  set the log level [panic|fatal|error|warn|info|debug|trace] (default "info") 
2022-04-07 20:52:49.065 -f, --log-format string                 set the log format (default "text") 
2022-04-07 20:52:49.065 -h, --help                              help for idpscim 
2022-04-07 20:52:49.065 -u, --gws-user-email string             GWS user email with allowed access to the Google Workspace Service Account 
2022-04-07 20:52:49.065 -s, --gws-service-account-file string   path to Google Workspace service account file (default "credentials.json") 
2022-04-07 20:52:49.065 -q, --gws-groups-filter strings         GWS Groups query parameter, example: --gws-groups-filter 'name:Admin* email:admin*' --gws-groups-filter 'name:Power* email:power*' 
2022-04-07 20:52:49.065 -d, --debug                             fast way to set the log-level to debug 
2022-04-07 20:52:49.065 -c, --config-file string                configuration file (default ".idpscim.yaml") 
2022-04-07 20:52:49.065 -e, --aws-scim-endpoint string          AWS SSO SCIM API Endpoint 
2022-04-07 20:52:49.065 -t, --aws-scim-access-token string      AWS SSO SCIM API Access Token 
2022-04-07 20:52:49.065 -b, --aws-s3-bucket-name string         AWS S3 Bucket name to store the state 
2022-04-07 20:52:49.065 -k, --aws-s3-bucket-key string          AWS S3 Bucket key to store the state (default "state.json") 
2022-04-07 20:52:49.065 Flags: 
2022-04-07 20:52:49.065 idpscim [flags] 
2022-04-07 20:52:49.065 Usage: 
2022-04-07 20:52:49.066 REPORT RequestId: 7b5def82-1eaa-446b-b4fd-eb73abac643eDuration: 3044.77 msBilled Duration: 3045 msMemory Size: 2048 MBMax Memory Used: 58 MB 
2022-04-07 20:52:49.066 END RequestId: 7b5def82-1eaa-446b-b4fd-eb73abac643e 

This continues for several minutes, until eventually the API comes back. This is what we experience:

2022-04-07 20:58:46.031 START RequestId: 3243d24e-5393-4a61-b2ea-c053c5a408db Version: $LATEST 
2022-04-07 20:58:46.174 time="2022-04-07T20:58:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleAdminEmail-XXXXXX" 
2022-04-07 20:58:46.174 time="2022-04-07T20:58:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleCredentials-XXXXXX" 
2022-04-07 20:58:46.174 time="2022-04-07T20:58:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMAccessToken-XXXXXX" 
2022-04-07 20:58:46.174 time="2022-04-07T20:58:46Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMEndpointUrl-XXXXXX" 
2022-04-07 20:58:46.192 time="2022-04-07T20:58:46Z" level=info msg="getting Identity Provider data" group_filter="[name:SSO* name:VPN*]" 
2022-04-07 20:58:46.192 time="2022-04-07T20:58:46Z" level=info msg="starting sync groups" codeVersion=v0.0.9 
2022-04-07 20:58:46.669 time="2022-04-07T20:58:46Z" level=warning msg="skipping member because is a group, but group members will be included" email=vpn_aws_development_workload@example.com id=02nusc1940yy217 
............long list of groups being skipped......................
2022-04-07 20:59:14.335 time="2022-04-07T20:59:14Z" level=info msg="getting state data" 
2022-04-07 20:59:14.435 SDK 2022/04/07 20:59:14 WARN Response has no supported checksum. Not validating response payload. 
2022-04-07 20:59:14.443 time="2022-04-07T20:59:14Z" level=info msg="syncing from state" lastsync="2022-04-07T20:57:36Z" since=1m38.443093026s 
2022-04-07 20:59:14.444 time="2022-04-07T20:59:14Z" level=info msg="reconciling groups" idp=21 state=87 
2022-04-07 20:59:14.444 time="2022-04-07T20:59:14Z" level=info msg="provider groups and state groups are different" 
2022-04-07 20:59:14.445 time="2022-04-07T20:59:14Z" level=warning msg="deleting groups" quantity=66 
2022-04-07 20:59:14.445 time="2022-04-07T20:59:14Z" level=info msg="no groups to be updated" 
2022-04-07 20:59:14.445 time="2022-04-07T20:59:14Z" level=info msg="no groups to be create" 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=warning msg="deleting user" email=some_user@example.com user="Some User" 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=warning msg="deleting users" quantity=4 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=info msg="no users to be updated" 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=info msg="no users to be created" 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=info msg="reconciling users" idp=44 state=48 
2022-04-07 20:59:19.719 time="2022-04-07T20:59:19Z" level=info msg="provider users and state users are different" 
2022-04-07 20:59:20.328 time="2022-04-07T20:59:20Z" level=warning msg="deleting user" email=guilherme.blanco@example.com user="Guilherme Blanco" 
2022-04-07 20:59:20.566 time="2022-04-07T20:59:20Z" level=warning msg="deleting user" email=some_user2@example.com user="Some User II" 
2022-04-07 20:59:20.825 time="2022-04-07T20:59:20Z" level=warning msg="deleting user" email=some_user3@example.com user="Some User III" 
2022-04-07 20:59:21.104 time="2022-04-07T20:59:21Z" level=info msg="provider groups-members and state groups-members are different" 
2022-04-07 20:59:21.109 time="2022-04-07T20:59:21Z" level=info msg="reconciling groups members" idp=21 state=87 
2022-04-07 20:59:21.117 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=guilherme.blanco@example.com group=SSO-AWS-Management-Administrator 
2022-04-07 20:59:21.117 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=some_user2@example.com group=SSO-AWS-Management-Administrator 
2022-04-07 20:59:21.117 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=some_user3@example.com group=SSO-AWS-Management-Administrator 
2022-04-07 20:59:21.117 time="2022-04-07T20:59:21Z" level=warning msg="removing users to groups" quantity=66 
2022-04-07 20:59:21.117 time="2022-04-07T20:59:21Z" level=info msg="no users to be joined to groups" 
2022-04-07 20:59:21.163 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=guilherme.blanco@example.com group=SSO-AWS-Development-Administrator
2022-04-07 20:59:21.246 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer1@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.246 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer2@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.246 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer3@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.247 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer4@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.247 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer5@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.247 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=developer6@example.com group=SSO-AWS-Development-PowerUser 
2022-04-07 20:59:21.464 time="2022-04-07T20:59:21Z" level=warning msg="removing member from group" email=guilherme.blanco@example.com group=SSO-AWS-Staging-Administrator
................this keeps groing forever deleting users.......................
2022-04-07 20:59:23.627 time="2022-04-07T20:59:23Z" level=info msg="storing the new state" groups=21 lastSync="2022-04-07T20:59:23Z" users=44 
2022-04-07 20:59:23.707 time="2022-04-07T20:59:23Z" level=info msg="sync groups completed" duration=37.515191425s 
2022-04-07 20:59:23.707 time="2022-04-07T20:59:23Z" level=info msg="sync completed" 
2022-04-07 20:59:23.709 REPORT RequestId: 3243d24e-5393-4a61-b2ea-c053c5a408dbDuration: 37674.80 msBilled Duration: 37675 msMemory Size: 2048 MBMax Memory Used: 58 MB 
2022-04-07 20:59:23.709 END RequestId: 3243d24e-5393-4a61-b2ea-c053c5a408db 

Subsequent calls keep alternating the groups and users returned, further deleting groups (even newly created ones) and users. Eventually, API comes back to normal, and all groups, users and memberships get recreated.

Now the challenge comes when we talk about AWS SSO Groups. In AWS SSO, we assign the Group, Account and PermissionSet together, saying that Admininistrator has AdministratorAccess over account XXXXXXXX. When the group is removed, this assignment is lost, and all users can see is an empty access space, like this picture below:

Screenshot from 2022-04-08 09-42-02

It also goes further, as: 1- we use EKS clusters, and generated IAM roles (ie. AWSReservedSSO_AdministratorAccess_02ca65b47d7fe656) are mapped as a role for system:masters in EKS, allowing administrators to perform their duties 2- we use LakeFormation, and generated IAM roles are also mapped to data security constraints in the Data Catalog, providing access granting or restrictions to databases, tables, columns and rows.

To resolve this issue, once updated our IaC logic to rely on group names instead of their IDs, fetching the group, and then relying on it to perform account assignment. This helped us to re-assign the permissions just by running a terraform apply again. This process is quick (and the same as @jdonohoo illustrated) and takes roughly 5 minutes.

After that, we need to join every account and re-map all the IAM roles that got reprovisioned by AWS SSO into assigned accounts and run IaC again to update the mapped roles in EKS.

Finally, we need to join the Reporting account, remap all IAM roles too reprovisioned and run IaC again to grant access to our data staff to visualize Data Catalog components.

As you might have seen, if this happened once it wouldn't be so concerning, but since this is now almost consistently happening every week, it's a call to action issue from our end. It takes an individual from infrastructure team roughly 4h to address everything now, and it is getting more painful as we keep adding new job functions (read as groups), assignments (new accounts to developers or other teams) and users (we are still rolling out VPN across the organization), expanding rapidly the usage and reliability on this portion of the system to work effectively.

This sprint we have a task ensure this is resolved. We would appreciate if you could help us resolving it sooner. Our only other alternative will be to fork and maintain it indefinitely, which I don't think it is the best for both of us. =)

Sorry for the long post, I wanted to show you exactly how painful it is, so you understand the motivation behind of this ask to you.

jdonohoo commented 2 years ago

@guilhermeblanco we also use EKS, and are living through the pain real time. My script got devs/admins back into AWS so they can continue to work.

However, now they have no cluster access to EKS until some one in our ops group goes and redoes the mappings like you described.

This is a total nightmare, we are at the same point of considering forking which seems like the wrong approach for everyone.

christiangda commented 2 years ago

hi @guilhermeblanco, @jdonohoo thank you for the detailed information and explanation.

Now I understand very well your problem.

I'm checking the code to figure out how to mitigate this, but I will need your help.

Let me first explain a little bit about the main workflow - png / main workflow - html of the program

  1. Get from IDP -> Grops, Users and Groups Members. see -> workflow step: 1 - 10
  2. Check if this is the first time syncing. see -> workflow step: 11 2.a First time syncing. see -> workflow step: 11-2 2.b Reconciling with State File. see -> workflow step: 11-1
  3. Once we sync the changes between the IDP and SCIM using paths 2.a or 2.b we create the new State File. see -> workflow step: 12
  4. The State File is saved. see -> workflow step: 11

So, having said that:

  1. in the @guilhermeblanco log trace you can see that 2022-04-07 20:52:49.064 Error: cannot sync groups and their members: error getting groups members: idp: error getting group members: idp: error listing group members: googleapi: Error 503: The service is currently unavailable., backendError which means the program was stopped because the server returned an error
  2. There are no way to go to the SCIM API side without have the IDP API DATA, I mean if you see -> workflow step: 1 - 10, I mean until we don't have all the data from the IDP API (Groups, Users and Groups Members) the program doesn't execute any operation in the SCIM API side
  3. The only way we can have the error you are experimenting is when the program fail in workflow step: 11-1-4 or workflow step: 11-1-8 or workflow step: 11-1-13 , I mean in the moment the program try to create/delete/update entities in the SCIM API side once it has all the IDP API DATA.. for example if the AWS Lambda is interrupted

Please check the main workflow - png and let me know if you understand me.

IMPORTANT:

  1. @guilhermeblanco I see in your logs that you are not using the latest version of the program, please update this ASAP, because I fixed a lot of problems in the SCIM API side and fews in the IDP API side too, since the version you are using.
  2. @jdonohoo which version of the program are you using?

Maybe the version of the program is the problem!, because I don't have any of these issues in my implementation in production, and I manage more than 105 users and 40 Groups.

guilhermeblanco commented 2 years ago

Hi @christiangda!

I completely understand you. As a quick stop-gap, I upgraded to version 0.0.12 (we were running 0.0.10 now, the trace was from solid months ago). I'll actively monitor the lambda execution and in case this issue happens again, I'll be able to report here.

My ask is to keep the ticket open for now until I either come back in 30 days to close it, or the issue happens again and we can continue conversation. In the meantime, rest assured I'll have all eyes on this.

As for numbers, we have reached 91 groups and 125 users for the moment. I am holding off rolling this out organization wide until we ensure my reported ticket is no longer an issue. The expectation is have a closer number of groups, but over 500 users.

Thanks for your help and dedication to this project! =)

obscurerichard commented 2 years ago

This feature would have been useful to have for another reason - I just tried to use this project on an SSO installation that had a bunch of manually created groups. It ended up deleting all of them, which I did not expect. In a situation where you are transitioning to using groups provisioned by SCIM it would be nice to be able to keep any manually-created groups until they could be fully deprecated.

guilhermeblanco commented 2 years ago

Hi @obscurerichard! You are absolutely right, this feature would fit nicely.

guilhermeblanco commented 2 years ago

Hi @christiangda!

As I mentioned, I'd closely monitor these executions and report back on next Google failure. Today at 16:40 EST we had another episode.

Personally, I think it is time to have this supported, and I'll gladly review again your flowchart and dedicate some time to have this implemented. A PR should come your way within the next few days, but please be patience as I am not a Go developer. All I'd ask is to help me ensure the solution is robust and pick the best flag name to add. So far the best I could think of is aws-groups-upsert-only.

I'm happy to share the logs below of how it happened:

2022-06-07T16:40:05.366-04:00   START RequestId: c2408b58-71f4-41d8-9071-80d358dc3791 Version: $LATEST
2022-06-07T16:40:05.369-04:00   time="2022-06-07T20:40:05Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleAdminEmail-XXXXXX"
2022-06-07T16:40:05.447-04:00   time="2022-06-07T20:40:05Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncGoogleCredentials-XXXXXX"
2022-06-07T16:40:05.486-04:00   time="2022-06-07T20:40:05Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMAccessToken-XXXXXX"
2022-06-07T16:40:05.507-04:00   time="2022-06-07T20:40:05Z" level=debug msg="reading secret" name="arn:aws:secretsmanager:us-east-1:XXXXXXXXXX:secret:SSOSyncSCIMEndpointUrl-XXXXXX"
2022-06-07T16:40:05.544-04:00   time="2022-06-07T20:40:05Z" level=info msg="starting sync groups" codeVersion=v0.0.12
2022-06-07T16:40:05.545-04:00   time="2022-06-07T20:40:05Z" level=info msg="getting Identity Provider data" group_filter="[name:SSO* name:VPN*]"
2022-06-07T16:40:06.011-04:00   time="2022-06-07T20:40:06Z" level=warning msg="skipping member because is a group, but group members will be included" email=some_group@example.com id=02iq8gzs3n7dlbg
2022-06-07T16:40:06.011-04:00   time="2022-06-07T20:40:06Z" level=warning msg="skipping member because is a group, but group members will be included" email=some_group@example.com id=02iq8gzs3n7dlbg
............long list of groups being skipped......................
2022-06-07T16:40:51.659-04:00   time="2022-06-07T20:40:51Z" level=info msg="getting state data"
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=info msg="syncing from state" lastsync="2022-06-07T20:38:22Z" since=2m29.792929871s
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=info msg="provider groups and state groups are different"
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=info msg="reconciling groups" idp=21 state=94
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=info msg="no groups to be create"
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=info msg="no groups to be updated"
2022-06-07T16:40:51.794-04:00   time="2022-06-07T20:40:51Z" level=warning msg="deleting groups" quantity=73
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=info msg="provider users and state users are different"
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=info msg="reconciling users" idp=53 state=132
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=info msg="no users to be created"
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=info msg="no users to be updated"
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=warning msg="deleting users" quantity=79
2022-06-07T16:40:58.501-04:00   time="2022-06-07T20:40:58Z" level=warning msg="deleting user" email=developer1@eblock.ca user="Developer 1"
2022-06-07T16:40:59.086-04:00   time="2022-06-07T20:40:59Z" level=warning msg="deleting user" email=developer2@eblock.ca user="Developer 2"
2022-06-07T16:40:59.391-04:00   time="2022-06-07T20:40:59Z" level=warning msg="deleting user" email=developer3@eblock.ca user="Developer 3"
................this keeps groing forever deleting users.......................
2022-06-07T16:41:15.958-04:00   Error: cannot sync groups and their members: error syncing state: error reconciling users: error deleting users in SCIM Provider: scim: error deleting user: 90676687e4-7d26e564-2ed0-48c5-96d2-6e9d6d6bd5ab, statusCode: 409, errCode: 409 Conflict, errMsg: {"schema":["urn:ietf:params:scim:api:messages:2.0:Error"],"schemas":["urn:ietf:params:scim:api:messages:2.0:Error"],"detail":"Refused to create a new, duplicate resource.","status":"409","exceptionRequestId":"bf9a2424-d08d-4696-8f18-38b60d86bdc5","timeStamp":"2022-06-07 20:41:15.956"}
2022-06-07T16:41:15.958-04:00   Usage:
2022-06-07T16:41:15.958-04:00   idpscim [flags]
2022-06-07T16:41:15.958-04:00   Flags:
2022-06-07T16:41:15.958-04:00   -k, --aws-s3-bucket-key string AWS S3 Bucket key to store the state (default "state.json")
2022-06-07T16:41:15.958-04:00   -b, --aws-s3-bucket-name string AWS S3 Bucket name to store the state
2022-06-07T16:41:15.958-04:00   -t, --aws-scim-access-token string AWS SSO SCIM API Access Token
2022-06-07T16:41:15.958-04:00   -e, --aws-scim-endpoint string AWS SSO SCIM API Endpoint
2022-06-07T16:41:15.958-04:00   -c, --config-file string configuration file (default ".idpscim.yaml")
2022-06-07T16:41:15.958-04:00   -d, --debug fast way to set the log-level to debug
2022-06-07T16:41:15.958-04:00   -q, --gws-groups-filter strings GWS Groups query parameter, example: --gws-groups-filter 'name:Admin* email:admin*' --gws-groups-filter 'name:Power* email:power*'
2022-06-07T16:41:15.958-04:00   -s, --gws-service-account-file string path to Google Workspace service account file (default "credentials.json")
2022-06-07T16:41:15.958-04:00   -u, --gws-user-email string GWS user email with allowed access to the Google Workspace Service Account
2022-06-07T16:41:15.958-04:00   -h, --help help for idpscim
2022-06-07T16:41:15.958-04:00   -f, --log-format string set the log format (default "text")
2022-06-07T16:41:15.958-04:00   -l, --log-level string set the log level [panic|fatal|error|warn|info|debug|trace] (default "info")
2022-06-07T16:41:15.958-04:00   -m, --sync-method string Sync method to use [groups] (default "groups")
2022-06-07T16:41:15.958-04:00   -v, --version version for idpscim
2022-06-07T16:41:15.958-04:00   cannot sync groups and their members: error syncing state: error reconciling users: error deleting users in SCIM Provider: scim: error deleting user: 90676687e4-7d26e564-2ed0-48c5-96d2-6e9d6d6bd5ab, statusCode: 409, errCode: 409 Conflict, errMsg: {"schema":["urn:ietf:params:scim:api:messages:2.0:Error"],"schemas":["urn:ietf:params:scim:api:messages:2.0:Error"],"detail":"Refused to create a new, duplicate resource.","status":"409","exceptionRequestId":"bf9a2424-d08d-4696-8f18-38b60d86bdc5","timeStamp":"2022-06-07 20:41:15.956"}: withStack null
2022-06-07T16:41:15.969-04:00   END RequestId: c2408b58-71f4-41d8-9071-80d358dc3791
2022-06-07T16:41:15.969-04:00   REPORT RequestId: c2408b58-71f4-41d8-9071-80d358dc3791 Duration: 70589.47 ms Billed Duration: 70590 ms Memory Size: 2048 MB Max Memory Used: 60 MB
guilhermeblanco commented 2 years ago

Adding to the issue, even though Google API seems to be back to normal, it is not creating the groups anymore. The request that deleted the groups did not finish properly, and the state file was not updated (with 94 groups there).

Once Google API went back to normal, it attempts to reconcile the groups and since state hash code is the same as google hash code, it never attempts to create the removed groups.

My only resort to address the issue was to remove the state file, and it imported the groups and users, but it is now hitting the lambda execution time limit of 15min. If we break down the state file into groups, users and members, it would likely work.

Here is where it gets stuck for a long time...

2022-06-07T23:54:45.968-04:00   time="2022-06-08T03:54:45Z" level=info msg="getting SCIM Groups Members"
2022-06-08T00:03:50.430-04:00   time="2022-06-08T04:03:50Z" level=info msg="reconciling groups members" idp=94 scim=12408

Thinking about this problem overnight, the state should be split into groups, users and one file per group_membership. I managed to get this back to operational state at a 14min mark, but I had to deactivate (comment out code, compile, upload, execute) for each one of the steps. The group membership alone took 14min.

If we split the files like mentioned, it would be possible to paginate and perform the sync of group members on a per page basis, it would store the state file per group_member, and it could easily skip the ones already mapped, and consequently be able to resume changes in case one execution time limit ends.

obscurerichard commented 2 years ago

If the batch of changes grows too large, this will inevitably grow beyond the 15 minute Lambda execution time. It might also be worth exploring whether an async architecture where a batch of changes gets chunked up into messages and sent via SQS to a lambda function that processes these in batches would be in order.

sonrai-doyle commented 2 years ago

I'd be very excited to see this feature come through. I've had it happen a couple times in the last 2 weeks.

m1keil commented 2 years ago

Just a drive by comment from someone who is evaluating solutions for that lack of SCIM support between AWS and Google Workspace:

Maybe instead of deleting immidietly, you can mark it as candidate for deletion with a due date (say in 3 days). Remove the mark when/if Google API gets back to its old self. And actually remove after mark expires.

laurentdelosieresfact commented 1 year ago

Hello,

We experienced the same issue with Google on May 5th. We got one group deleted and recreated from scratch. However, since the terraform was not applied, no role was attached to the new group and thus employees were not able to enter. Fortunately, this did not affect people who were on oncall. Since we are delegating the role attribution to terraform, we would also prefer delegating group creation/deletion to terraform. In the "--sync-method", there is only one value "groups", could we add "users" too (so we can synchronize users only that belong to the groups specified in the configuration) ?

Best, L.