intuit / Trapheus

This tool automates restoration of RDS database instances from snapshots into any dev, staging or production environments. It supports individual RDS Snapshot as well as cluster snapshot restore operations.
https://intuit.github.io/Trapheus/
MIT License
106 stars 53 forks source link

141 update failure notification #144

Closed WallysFerreira closed 11 months ago

WallysFerreira commented 1 year ago

Description

Change result message of email and slack notification in errors

WallysFerreira commented 1 year ago

Should I add constants with new key strings?

namitad commented 1 year ago

@WallysFerreira yes - lets move them to constants

namitad commented 1 year ago

@WallysFerreira as part of the PR description, could you also add a screenshot of the new email and slack notification post your changes?

WallysFerreira commented 1 year ago

@WallysFerreira as part of the PR description, could you also add a screenshot of the new email and slack notification post your changes?

Yes I can do that

namitad commented 1 year ago

@WallysFerreira could you check why the PR build on circleci is failing?

WallysFerreira commented 1 year ago

@WallysFerreira could you check why the PR build on circleci is failing?

Yeah I will do that

WallysFerreira commented 1 year ago

circleci check is now passing, apparently there was an unit test failing and I think that was causing the check to fail too.

I'll add the screenshots you asked for soon

namitad commented 1 year ago

@WallysFerreira some minor comments. once you share the screenshots of the final email/slack, i will go ahead and approve the PR

WallysFerreira commented 1 year ago

I'm having some issues running the application on my computer to take the screenshots

namitad commented 1 year ago

@WallysFerreira could you elaborate - what issues are you facing?

WallysFerreira commented 1 year ago

Maybe I jumped some step on the tutorial, but when it tries to deploy to AWS it returns

Error: Template file not found at /home/wf/Trapheus/deploy.yaml

namitad commented 1 year ago

@WallysFerreira do reverify your setup and a give it a try again. If it doesnt work, let me know - i can try running your changes locally to verify the output.

WallysFerreira commented 1 year ago

I've tried running on a codespace and it's still returning an error, even though there is a template.yaml and deploy.yaml on the directory. If you could run the changes locally I'd really appreciate it

FileNotFoundError: [Errno 2] No such file or directory: 'sam package --template-file template.yaml --output-template-file deploy.yaml --s3-bucket teste-741284363623 --region us-east-2'

namitad commented 1 year ago

Ok.. give me a day or two. Will try this locally and get back to you.

namitad commented 12 months ago

@WallysFerreira i ran your changes locally. but i get the following error:

{
  "errorMessage": "'Identifier'",
  "errorType": "KeyError",
  "requestId": "9c87e534-7933-4044-9295-d8a6834c23c8",
  "stackTrace": [
    "  File \"/var/task/email_function.py\", line 18, in lambda_handler\n    result[constants.DB_ID] = event['Identifier']\n"
  ]
}
Screenshot 2023-10-08 at 7 46 49 PM

can we check why identifier is not part of the input and if any changes need to be made in the template.yaml to cascade this input?

also, i faced the same issue as you did in trying to run trapheus locally - some recent changes have broken the install.py file. Could you use this version of install.py locally while i am check the fix for the same?

WallysFerreira commented 12 months ago

Yeah sure I'll look into it and use the new install.py. Thanks

namitad commented 12 months ago

@WallysFerreira i have merged the fix for install.py. can you pull the latest on your branch and try - no need to use a separate version of the file

WallysFerreira commented 12 months ago

I think the error was caused by the capital 'i' in identifier, but can't actually check because I'm getting another error when running install.py

Error: Unable to upload artifact src/common/common.zip referenced by ContentUri parameter of CommonLambdaLayer resource. Unable to locate credentials

namitad commented 12 months ago

@WallysFerreira looks like your temporary aws credentials may have expired. could you refresh and try again?

WallysFerreira commented 11 months ago

I tried with new aws credentials and still got the same error :( Could you please try running it locally?

stationeros commented 11 months ago

@namitad / @WallysFerreira there is a couple of issues in the way this is implemented. The way we are getting the identifier is a bit off. The event that gets generated is this

{'Error': 'SnapshotCreationException', 'Cause': '{"errorMessage": "Identifier:database-3 \\nAn error occurred (DBClusterNotFoundFault) when calling the CreateDBClusterSnapshot operation: DBCluster not found: database-3", "errorType": "SnapshotCreationException", "requestId": "6568ebed-be45-4962-bdc3-35bf080c4322", "stackTrace": ["  File \\"/var/task/cluster_snapshot_function.py\\", line 25, in lambda_create_cluster_snapshot\\n    util.eval_snapshot_exception(error, cluster_id, rds)\\n", "  File \\"/opt/python/utility.py\\", line 72, in eval_snapshot_exception\\n    raise custom_exceptions.SnapshotCreationException(error_message)\\n"]}'} | event is {'Error': 'SnapshotCreationException', 'Cause': '{"errorMessage": "Identifier:database-3 \\nAn error occurred (DBClusterNotFoundFault) when calling the CreateDBClusterSnapshot operation: DBCluster not found: database-3", "errorType": "SnapshotCreationException", "requestId": "6568ebed-be45-4962-bdc3-35bf080c4322", "stackTrace": [" File \\"/var/task/cluster_snapshot_function.py\\", line 25, in lambda_create_cluster_snapshot\\n util.eval_snapshot_exception(error, cluster_id, rds)\\n", " File \\"/opt/python/utility.py\\", line 72, in eval_snapshot_exception\\n raise custom_exceptions.SnapshotCreationException(error_message)\\n"]}'}

As you can see there is no direct way of getting the Identifier as a key. We have a utilty function defined that needs to be utilised https://github.com/intuit/Trapheus/blob/master/src/common/python/utility.py#L33 to get the identifier from the event.

Secondly, the way TASK_NAME is getting extracted also is a bit odd. Its not present directly in the event but its in the context object of the lambda runtime. You can simply do context.function_name to get that.

With the above changes, you guys should be able to get that out. Let me know if you need help

Screenshot 2023-10-15 at 3 32 47 PM
namitad commented 11 months ago

@WallysFerreira it would still run into a similar issue (as mentioned by @stationeros above) with the "identifier" change. we need to explicitly pass the identifier name to the state where alerts are being sent. Can we look at using a pass state https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-pass-state.html to add this to the state machine template.yaml?

This can also be used in the future if we need to add additional values - such as execution id.

namitad commented 11 months ago

@stationeros yes - taskname is being cascaded because in the context object - it would either be the email alert/slack notification lambda name. even if set from each step, it would expose the lambda function name to the user instead of the flow. we can revisit if this change needs to be done and maybe address that as a separate issue.

WallysFerreira commented 11 months ago

@stationeros that's super helpful, thanks. I'll make those changes asap.

@namitad I'll try to look into using a pass state afterwards, but it would still work without it if I make the changes @stationeros said, right? Just so I can do one thing at a time

stationeros commented 11 months ago

@WallysFerreira Thats right? You can possibly fix the instance identifier and the task name chaining can be done later. I can break this into subtasks for this to be managed better

stationeros commented 11 months ago

@WallysFerreira Can you make the code change for instance identifier , you just need to call this inside the lambda , we can then merge this commit

stationeros commented 11 months ago

@WallysFerreira Closing this PR due to inactivity, feel free to re-open and submit if you are still working on this.