hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.95k stars 1.96k forks source link

feature request: show why monitoring evaluation has failed in case of revert #18301

Open Kamilcuk opened 1 year ago

Kamilcuk commented 1 year ago

Proposal

nomad run monitors evaluation. It repeats a lot of information, prionts a lot of unrelated stuff that I have to go through when analyzing logs.

I propose to clean up the logs. Logs currnetly look like the following: ``` + nomad job run streamlit.nomad.hcl ==> 2023-08-23T10:47:47Z: Monitoring evaluation "d01ca58f" 2023-08-23T10:47:47Z: Evaluation triggered by job "streamlit-htsopportunities" 2023-08-23T10:47:48Z: Evaluation within deployment: "037572cd" 2023-08-23T10:47:48Z: Allocation "638e8a80" created: node "0a4e222d", group "streamlit-htsopportunities" 2023-08-23T10:47:48Z: Evaluation status changed: "pending" -> "complete" ==> 2023-08-23T10:47:48Z: Evaluation "d01ca58f" finished with status "complete" ==> 2023-08-23T10:47:48Z: Monitoring deployment "037572cd" 2023-08-23T10:47:48Z ID = 037572cd Job ID = streamlit-htsopportunities Job Version = 15 Status = running Description = Deployment is running pending automatic promotion Deployed Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 0 2023-08-23T06:57:47-04:00 2023-08-23T10:52:48Z ID = 037572cd Job ID = streamlit-htsopportunities Job Version = 15 Status = running Description = Deployment is running pending automatic promotion Deployed Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 1 2023-08-23T06:57:47-04:00 2023-08-23T10:57:47Z ID = 037572cd Job ID = streamlit-htsopportunities Job Version = 15 Status = failed Description = Failed due to progress deadline - rolling back to job version 14 Deployed Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 1 2023-08-23T06:57:47-04:00 2023-08-23T10:57:48Z ID = eb78f148 Job ID = streamlit-htsopportunities Job Version = 16 Status = running Description = Deployment is running Deployed Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 0 0 2023-08-23T07:07:47-04:00 2023-08-23T10:57:59Z ID = eb78f148 Job ID = streamlit-htsopportunities Job Version = 16 Status = running Description = Deployment is running Deployed Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 1 0 2023-08-23T07:07:58-04:00 2023-08-23T10:58:01Z ID = eb78f148 Job ID = streamlit-htsopportunities Job Version = 16 Status = successful Description = Deployment completed successfully Deployed Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 1 0 2023-08-23T07:07:58-04:00 ```
Would be great to print less stuff and not repeat information that much and most importantly print exactly _why_ the deployment has failed. And do not print "deployed" if the job was not deployed, but reverted. ``` + nomad job run streamlit.nomad.hcl ==> 2023-08-23T10:47:47Z: Monitoring evaluation "d01ca58f" 2023-08-23T10:47:47Z: Evaluation triggered by job "streamlit-htsopportunities" 2023-08-23T10:47:48Z: Evaluation within deployment: "037572cd" 2023-08-23T10:47:48Z: Allocation "638e8a80" created: node "0a4e222d", group "streamlit-htsopportunities" 2023-08-23T10:47:48Z: Evaluation status changed: "pending" -> "complete" ==> 2023-08-23T10:47:48Z: Evaluation "d01ca58f" finished with status "complete" ==> 2023-08-23T10:47:48Z: Monitoring deployment "037572cd" 2023-08-23T10:47:48Z ID = 037572cd Job ID = streamlit-htsopportunities Job Version = 15 Status = running Description = Deployment is running pending automatic promotion TO BE DEPLOYED Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 0 2023-08-23T06:57:47-04:00 2023-08-23T10:52:48Z Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 1 2023-08-23T06:57:47-04:00 2023-08-23T10:57:47Z Description = Failed due to progress deadline - rolling back to job version 14 REVERTING Task Group Auto Revert Promoted Desired Canaries Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true false 1 1 1 0 1 2023-08-23T06:57:47-04:00 2023-08-23T10:57:48Z Job Version = 16 Status = running Description = Deployment is running REVERTED Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 0 0 2023-08-23T07:07:47-04:00 2023-08-23T10:57:59Z Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 1 0 2023-08-23T07:07:58-04:00 Status = successful Description = Deployment completed successfully REVERTED Task Group Auto Revert Desired Placed Healthy Unhealthy Progress Deadline streamlit-htsopportunities true 1 1 1 0 2023-08-23T07:07:58-04:00 VISIBLE ERROR: job was reverted! ```

Side note: the message says that job was reverted to version 14, but then it started running version 16. This is confusing.

Use-cases

The use case is to simplify the analyzation of logs in long logs streams. The proposal is

Attempted Solutions

The solution is to read the logs very carefully.

lgfa29 commented 1 year ago

Thanks for the suggestion @Kamilcuk! It would be helpful to make the failure more clear. I've placed this into our board for further roadmapping.

Side note: the message says that job was reverted to version 14, but then it started running version 16. This is confusing.

Job versions are immutable, so "reverting to version 14" means creating a new version based on the spec of version 14. I agree that it can be confusing, but just wanted to explain that this is intended result.