radius-project / radius

Radius is a cloud-native, portable application platform that makes app development easier for teams building cloud-native apps.
https://radapp.io
Apache License 2.0
1.49k stars 95 forks source link

Update Error Responses From Deployment Engine #6053

Open sk593 opened 1 year ago

sk593 commented 1 year ago

Overview of feature request

This issue is being opened as a result of scheduled functional test failures. The deployment response during failures doesn't contain enough information to debug or to understand which resources is failing. There also may be issues with the DE returning a response before async calls can be completed, causing errors in deployment. Deployment responses should be more verbose and should fully process async responses before returning

Acceptance criteria

  1. Relay deployment failures to the user
  2. Specify which resource corresponds to which deployment code (in details section)
  3. Investigate DE returning before async calls complete

Additional context

Screenshot 2023-08-11 at 1 10 48 PM

AB#8952

rynowak commented 1 year ago

I suspect (based on what I know about the deployment engine codebase) that the failure is network related, and the root cause is incorrect retry logic in the deployment engine based on behavior differences in HttpClient between .NET core and .NET framework.

The code to check would be the exception-handling code that deals with outgoing HTTP requests in the deployment jobs. Specifically the error handling for failures that throw an exception from HttpClient.

We're also missing some positive or negative diagnostics from the deployment engine, as there is no error message in the logs and no error message in the response for this specific resource job.

shalabhms commented 1 year ago

@sk593 , did we get more detailed information about the issue in the logs , that helped debug the issue?

rynowak commented 1 year ago

The change to improve the logs was just merged like an hour ago.