ministryofjustice / modernisation-platform

A place for the core work of the Modernisation Platform • This repository is defined and managed in Terraform
https://user-guide.modernisation-platform.service.justice.gov.uk
MIT License
683 stars 290 forks source link

Some AWS Nuke workflows fail with error 255 #7623

Closed mikereiddigital closed 1 month ago

mikereiddigital commented 2 months ago

Expected Behavior

Sever jobs that comprise the AWS Nuke workflow are failing with the following error:

The following jobs are failing with this error:

Note that in some cases, the nuke job completes and the last resources listed as failing deletion were actually removed: For example https://github.com/ministryofjustice/modernisation-platform-environments/actions/runs/10235999742/job/28317541734 where cloudtrail shows the resources being removed.

Note that a manually triggered run of the nuke workflow will not run the actual apply. The condition requires a scheduled workflow.

DoD:

Actual Behavior

Scheduled run of the workflow fails for the above mentioned accounts.

Steps to Reproduce the Problem

No response

Version

aws nuke v2.25.0

Modules

No response

Account

No response

markgov commented 2 months ago

installed aws-nuke locally and managed to get it working but after getting it working i went to check the git hub actions job and have found that the accounts that are affected by this issue keeps changeing the only account that is consistant are the following account

also found this article which i am going to look at https://dotjoeblog.wordpress.com/2021/03/14/github-actions-aws-error-exit-code-255/

markgov commented 2 months ago

The error for nuke on electronic-monitoring-data-development is down to a time out trying to delete the following resources

w-b113a013d1cc9e84d Move data from buddi landing zone to data store
  - w-c5583c5a31c9b7926 Move data from g4s landing zone to data store

w-b113a013d1cc9e84d Move data from buddi landing zone to data store

w-c5583c5a31c9b7926 Move data from g4s landing zone to data store

These are aws transfer workflows

Possible workaround could be to add these to the exclude list.

markgov commented 2 months ago

Nomis failure is down to a permissions issue for AWSBackupVaultAccessPolicy

time="2024-08-11T12:03:25Z" level=error msg="AccessDeniedException: User: arn:aws:sts::***:assumed-role/MemberInfrastructureAccess/githubactionsgotestrolesession is not authorized to perform: backup:DeleteBackupVaultAccessPolicy on resource: arn:aws:backup:eu-west-2:***:backup-vault:aws/efs/automatic-backup-vault with an explicit deny in a resource-based policy\n\tstatus code: 403, request id: f6f5c055-60b7-4633-82fb-54e2b0355029"
Error: failed
Removal requested: 0 waiting, 1 failed, 59 skipped, 45 finished

eu-west-2 - AWSBackupVaultAccessPolicy - aws/efs/automatic-backup-vault - failed
markgov commented 2 months ago

https://github.com/ministryofjustice/modernisation-platform/pull/7712 apply to hopefully remove eu-west-2 - AWSBackupVaultAccessPolicy - aws/efs/automatic-backup-vault - failed

markgov commented 2 months ago

Found a strange issue on corp account see chat thread bellow https://mojdt.slack.com/archives/C6D94J81E/p1724228167415849 hopefully this might fix the issue for aws nuke

markgov commented 2 months ago

moving to block so we can see the results of the changes i have put in when the schedule runs on Monday

markgov commented 2 months ago

New errors but in the same accounts

markgov commented 2 months ago

Correction the nomis account is now green and working as expected going to work on putting an exception in for the electronic monitoring account but the corporate staff rostering account has a strange error need to wait untill next week to see if the same security group pops up as not being removed

markgov commented 2 months ago

created a PR to add an exception for transfer to see if that fix's the electronic monitoring error https://github.com/ministryofjustice/modernisation-platform-environments/pull/7629

markgov commented 1 month ago

Nuke Job run and only one failure this week and that is electronic monitoring and not being able to destroy the following resources eu-west-2 - GlueCrawler - rds-sqlserver-database-v2022-tf - failed eu-west-2 - EC2LaunchTemplate - zip_bastion_linux_template - [Name: "zip_bastion_linux_template"] - failed eu-west-2 - EC2LaunchTemplate - bastion_linux_template - [Name: "bastion_linux_template"] - failed eu-west-2 - GlueJob - catalog-dv-table-glue-job - failed eu-west-2 - GlueJob - dms-dv-glue-job-v2 - failed eu-west-2 - GlueJob - dms-dv-glue-job-v4d - failed eu-west-2 - GlueJob - rds-to-s3-parquet-migration - failed eu-west-2 - GlueJob - rds-to-s3-parquet-migration-monthly - failed eu-west-2 - GlueJob - resizing-parquet-files - failed

markgov commented 1 month ago

This is just a times out so moving to review

richgreen-moj commented 1 month ago

Reviewed, looks good to me. I imagine we still might get things popping up from time to time but this has removed all of the known errors we have at this point. Nice work @markgov 👍