hashicorp / levant

An open source templating and deployment tool for HashiCorp Nomad jobs
Mozilla Public License 2.0
829 stars 125 forks source link

deploy does not log failure events #295

Open phemmer opened 5 years ago

phemmer commented 5 years ago

Description

According to the README, levant claims:

Upon a deployment failure, Levant will inspect each allocation and log information about each event, providing useful information for debugging without the need for querying the cluster retrospectively.

However when I try to perform a deploy, levant is not living up to this promise.

# levant deploy -log-level=DEBUG -var version=2.0-3845 backoffice.nomad
2019-07-23T00:01:05-04:00 |DEBU| template/render: no variable file passed, trying defaults
2019-07-23T00:01:05-04:00 |DEBU| helper/files: no default var-file found
2019-07-23T00:01:05-04:00 |INFO| helper/variable: using command line variable with key version and value 2.0-3845
2019-07-23T00:01:05-04:00 |DEBU| levant/plan: triggering Nomad plan
2019-07-23T00:01:06-04:00 |INFO| levant/deploy: triggering a deployment job_id=backoffice
2019-07-23T00:01:07-04:00 |INFO| levant/deploy: evaluation 44215704-3054-944d-e395-a8711e961780 finished successfully job_id=backoffice
2019-07-23T00:01:07-04:00 |INFO| levant/deploy: job is not configured with update stanza, consider adding to use deployments job_id=backoffice
2019-07-23T00:01:07-04:00 |DEBU| levant/job_status_checker: running job status checker for job job_id=backoffice
2019-07-23T00:01:07-04:00 |INFO| levant/job_status_checker: job has status running job_id=backoffice
2019-07-23T00:01:07-04:00 |INFO| levant/job_status_checker: task backoffice in allocation 88a06bc9-b4c7-b2db-fe19-f172dfbb2c09 now in pending state job_id=backoffice
2019-07-23T00:01:42-04:00 |INFO| levant/job_status_checker: task backoffice in allocation 88a06bc9-b4c7-b2db-fe19-f172dfbb2c09 now in dead state job_id=backoffice
2019-07-23T00:01:42-04:00 |ERRO| levant/deploy: job deployment failed job_id=backoffice

So we see that the job failed, without any indication why. If I inspect the service, and then inspect the last failed allocation on that service, I see:

Recent Events:
Time                       Type                      Description
2019-07-23T00:02:55-04:00  Killing                   Sent interrupt
2019-07-23T00:02:55-04:00  Not Restarting            Exceeded allowed attempts 2 in interval 30m0s and mode is "fail"
2019-07-23T00:02:55-04:00  Failed Artifact Download  failed to download artifact "http://myartifactserver.com/backoffice.tar": bad response code: 404
2019-07-23T00:02:55-04:00  Downloading Artifacts     Client is downloading artifacts
2019-07-23T00:02:38-04:00  Restarting                Task restarting in 17.276746233s
2019-07-23T00:02:38-04:00  Failed Artifact Download  failed to download artifact "http://myartifactserver.com/backoffice.tar": bad response code: 404
2019-07-23T00:02:38-04:00  Downloading Artifacts     Client is downloading artifacts
2019-07-23T00:02:20-04:00  Restarting                Task restarting in 17.406498416s
2019-07-23T00:02:20-04:00  Failed Artifact Download  failed to download artifact "http://myartifactserver.com/backoffice.tar": bad response code: 404
2019-07-23T00:02:20-04:00  Downloading Artifacts     Client is downloading artifacts

So this clearly tells me the artifact download failed. But I had to go chasing down references to find it.

 

Output of levant version:

# levant version
Levant v0.2.7
Date: 2019-03-19T08:28:42Z
Commit: 9e952d55f171e63f5c7955e826401eac91ed0b28
Branch: 0.2.7
State: 0.2.7
Summary: 9e952d55f171e63f5c7955e826401eac91ed0b28

Output of consul version:

# consul version
Consul v1.4.4
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

Output of nomad version:

# nomad version
Nomad v0.9.3 (c5e8b66c3789e4e7f9a83b4e188e9a937eea43ce)
sh-turakhia commented 4 years ago

I have a similar issue. I have to log in to the nomad client to check the alloc errors