org-formation / org-formation-cli

Better than landingzones!
MIT License
1.41k stars 131 forks source link

Tasks will fail for new accounts #192

Open eduardomourar opened 3 years ago

eduardomourar commented 3 years ago

Whenever adding a new account through OrgFormation, the first time the pipeline runs the creation and assign to OU will work, but it will fail in each task for that new account. I believe that AWS Organizations is not really done behind the scenes, because it succeeds in the next build run after a few minutes.

OlafConijn commented 3 years ago

do you have a specific errorcode/response? i have already added a couple... the specific error you get might depend on the order of tasks in your specific project.

eduardomourar commented 3 years ago

even in verbose, here is the output:

Resource is not in the state stackCreateComplete (123456789012 = DevAccount) ERROR: Stack billing-alarm in account 123456789012 (eu-west-1) update failed. reason: Resource is not in the state stackCreateComplete (123456789012 = DevAccount) Resource is not in the state stackCreateComplete (use option --print-stack to print stack)

there is no stack in the account target so there is not logs to be checked

caviliar commented 3 years ago

Hello, we experience a similar issue. Which we believe is timing related.

This is from a fresh install; standard roles being added from 000-organization-build/*. We are using a pipeline with a deployment account and this failed in the build phase.

What we did, added new account in organization.yml to an OU that already existed. committed. pipeline then ran and failed.

Re-triggering the pipeline to run makes it succeed, which makes us think that it is a timing issue around account initialising or the role not configured/available yet.

Here is the sanitised build log:

INFO: Executing: include 000-organization-build/organization-tasks.yml.
INFO: Executing: update-organization organization.yml.
OC::ORG::Account              | New-Account    | Create (1111111111111)
OC::ORG::Account              | New-Account    | CommitHash
OC::ORG::OrganizationalUnit   | FooOU                        | Attach Account (New-Account)
OC::ORG::OrganizationalUnit   | FooOU                        | CommitHash
INFO: done
INFO: Task OrganizationUpdate execute successful.
INFO: Executing: update-stacks 000-organization-build/org-formation-build.yml organization-formation-build.
INFO: Stack organization-formation-build already up to date.
INFO: Task OrganizationBuildPipeline execute successful.
INFO: Executing: update-stacks 000-organization-build/org-formation-role.yml organization-formation-role-master.
INFO: Stack organization-formation-role-master already up to date.
INFO: Task MasterOrganizationFormationRole execute successful.
INFO: Executing: update-stacks 000-organization-build/org-formation-role.yml organization-formation-role.
ERROR: Stack organization-formation-role in account 1111111111111 (ap-southeast-2) update failed. reason: User: arn:aws:sts::222222222222:assumed-role/OrganizationFormationBuildAccessRole/OrganizationFormationBuild is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::1111111111111:role/OrganizationAccountAccessRole (1111111111111 = New-Account)
User: arn:aws:sts::222222222222:assumed-role/OrganizationFormationBuildAccessRole/OrganizationFormationBuild is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::1111111111111:role/OrganizationAccountAccessRole (use option --print-stack to print stack)
ERROR: 
ERROR: ==========================
ERROR: Stopped performing task(s)
ERROR: Following tasks failed: 
ERROR:  - Stack organization-formation-role in account 1111111111111 (ap-southeast-2) (1111111111111 = New-Account)
ERROR: ==========================
ERROR: 
ERROR: Task OrganizationFormationRole execute failed. reason: Number of failed tasks 1 exceeded tolerance for failed tasks 0.
theburningmonk commented 3 years ago

I'm having the same issue with @caviliar but it only happens when I updated to the latest version 0.9.14, rolling back to a previous version 0.9.5 fixed the problem.

OlafConijn commented 3 years ago

great, thanks! will fix/look into this soon. also: My understanding is that retrying the build is a workaround. is this correct?

caviliar commented 3 years ago

great, thanks! will fix/look into this soon. also: My understanding is that retrying the build is a workaround. is this correct?

@OlafConijn Yes, re-triggering the build is a workaround.

OlafConijn commented 3 years ago

looks like a bit of a mixed bag..... i have seen all sorts of different reasons that running perform-tasks after adding an account fails. including things that have been org-formation bugs and/or AWS behaviors when creating new accounts.

  1. Not properly using DependsOn between tasks. this, for large organizations is the most common cause. As tasks are ran in parallel tasks that depend on each other need to explicitly specify DependsOn. This problem typically can be worked around by retrying.

  2. AWS Account Creation being eventually consistent. i somehow have the feeling that @caviliar , your issue fits in this bucket. I have seen this issue more in ap regions, less in us or eu regions. some services also take longer to be initialized fully (e.g. systems manager i believe fails more often than others).

    • waiting until services are available in ap regions (assuming they are last) could be a solution except that this would fail on organizations where SCPs disable this. same would go for checking and waiting for services that we know are slower... not a good solution
    • another poor solution (but without negative side-effects) would be to add a configurable waittime after account creation. if this is something you run into you would be able to add an additional 5 seconds of waittime after account creation. how does that sound?
  3. around the time this bug was posted i too had seen a somewhat weird issue where stacks failed with a status not stackCreateComplete and after further inspection there is/was no stack. this seems to have been something temporary, have not seen this recently. have you @eduardomourar ?

dealing with these issues will continue to be a thing as it depends on AWS account creation behavior, org-formation behavior and customer configuration.

org-formation@0.9.15-beta.11 contains some improvements on this. soon will be released as 0.9.15. what i'll do is create something in the bug template to make sure the right things are added to the right GH issue so it will be easier to diagnose which specific issue is the underlying issue.

thanks!

dobeerman commented 3 years ago

I've quite similar issue. After I have created a new account and OU within one update, I'm getting the following ERRORs:

$ org-formation update organization.yml --profile admin
WARN: AccessDenied: unable to log into account 123123123123. This might have various causes, to troubleshoot:
https://github.com/OlafConijn/AwsOrganizationFormation/blob/master/docs/access-denied.md
WARN: AccessDenied: unable to log into account 456456456456. This might have various causes, to troubleshoot:
https://github.com/OlafConijn/AwsOrganizationFormation/blob/master/docs/access-denied.md
WARN: AccessDenied: unable to log into account 456456456456. This might have various causes, to troubleshoot:
https://github.com/OlafConijn/AwsOrganizationFormation/blob/master/docs/access-denied.md
WARN: AccessDenied: unable to log into account 321321321321. This might have various causes, to troubleshoot:
https://github.com/OlafConijn/AwsOrganizationFormation/blob/master/docs/access-denied.md
ERROR: failed executing task: Create OC::ORG::Account PropDomainDev1Account AccessDenied: User: arn:aws:iam::344143226674:user/Alex_S is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::344143226674:role/OrganizationAccountAccessRole
ERROR: error: AccessDenied, aws-request-id: 1717c7fe-5047-4000-8bc9-2e43195efebc
ERROR: User: arn:aws:iam::987987987987:user/Alex_S is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::987987987987:role/OrganizationAccountAccessRole

Alex_S has the FullAccess role.

P.S. WARNs are permanently in place.

OlafConijn commented 3 years ago

hi @dobeerman

i think your issue is somewhat different. looking at the logs your issue seems to be that accounts 123123123123, 456456456456, 321321321321 (etc) do not have the role OrganizationAccountAccessRole provisioned.

information about the this can be found here.

this problem will indeed persist untill either:

  1. this role is deployed to the accounts listed in the WARN.
  2. for these accounts an alternative roleName is specified in the account resource.
zaro0508 commented 3 years ago

I just saw this in our CI as well. Here's the PR that caused the failure, https://github.com/Sage-Bionetworks-IT/organizations-infra/pull/111/files.

That PR is basically adding a new AWS account, putting it in an OU and applying some budget tags similar to the ofn reference project

It failed on the first run with a similar error "Resource is not in the state stackCreateComplete". The account was created though. We re-ran the build and 2nd time there was no error.

We have created accounts with similar PRs in the past and no error on the 1st run on those. We've only created a few accounts so far therefore we can't tell how often this happens. Maybe one significant change which might have caused this issue is that we added --max-concurrent-stacks 100 and --max-concurrent-tasks 100 options to our CI's perform-tasks command. We set those to 100 because it was taking approximately 20-30 mins to create an account.

Environment ofn ver 0.9.15 node ver 15.5.1
Error Log DEBG: Setting build action on stack sceptre-cloudformation-bucket for ***/us-east-1 to None - hash matches stored target. (*** = MasterAccount) DEBG: Stack sceptre-cloudformation-bucket in account XXXXXXXXXXX (us-east-1) update starting... (XXXXXXXXXXX = NlpSandboxAccount) ERROR: error updating CloudFormation stack sagebase-billing-alarm in account XXXXXXXXXXX (us-east-1). Resource is not in the state stackCreateComplete (XXXXXXXXXXX = NlpSandboxAccount) ERROR: Stack sagebase-billing-alarm in account XXXXXXXXXXX (us-east-1) update failed. reason: Resource is not in the state stackCreateComplete (XXXXXXXXXXX = NlpSandboxAccount) Resource is not in the state stackCreateComplete Throttling: Resource is not in the state stackCreateComplete at Request.extractError (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/protocol/query.js:50:29) at Request.callListeners (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/sequential_executor.js:106:20) at Request.emit (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/sequential_executor.js:78:10) at Request.emit (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/request.js:688:14) at Request.transition (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/request.js:22:10) at AcceptorStateMachine.runTo (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/state_machine.js:14:12) DEBG: putting object to S3: { "Bucket": "organization-formation-***", "Key": "state.json" } (*** = MasterAccount) at /home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/state_machine.js:26:10 at Request. (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/request.js:38:9) at Request. (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/request.js:690:12) at Request.callListeners (/home/runner/work/organizations-infra/organizations-infra/node_modules/aws-sdk/lib/sequential_executor.js:116:18) ERROR: ERROR: ========================== ERROR: Stopped performing task(s) ERROR: Following tasks failed: ERROR: - Stack sagebase-billing-alarm in account XXXXXXXXXXX (us-east-1) (XXXXXXXXXXX = NlpSandboxAccount) ERROR: ========================== ERROR:
OlafConijn commented 3 years ago

interesting @zaro0508, thanks for sharing. will look into this