sous-chefs / ark

Development repository for the ark cookbook
https://supermarket.chef.io/cookbooks/ark
Apache License 2.0
99 stars 181 forks source link

Ark doesn't handle failed Chef run correctly #96

Open webcoyote opened 9 years ago

webcoyote commented 9 years ago

While trying to fix #93 I discovered that ark can leave a server in an unconfigured state if Chef halts during convergence. Here is the abbreviated code for the "dump" provider that shows why:

  remote_file new_resource.release_file do
    ...
    notifies :run, "execute[unpack #{new_resource.release_file}]"
  end

  execute "unpack #{new_resource.release_file}" do
    ...
    notifies :run, "execute[set owner on #{new_resource.path}]"
    action :nothing
  end

  execute "set owner on #{new_resource.path}" do
    ...
    action :nothing
  end

If the Chef run halts immediately after remote_file then the next time Chef runs it will detect that the file is already downloaded and skip the execute and set owner steps.

Also, in the event a bad sysop manually sets the file owner then the owner won't get reset during the Chef run as it should.

So I guess the underlying question is why use action :nothing and notify the next resource in the chain instead of simply using action :<default> for each action?

burtlo commented 9 years ago

Without the the action :nothing the resources would act every time. So every time we would unpack the file, set the owner, etc...

Taking action every time is not very "test and repair" friendly.

But what you point is absolutely correct. If there is a failure during any of the steps the whole system is broken unless:

I am sure that I could capture the state of the resources and look to see if they are not successful. If one were to fail I imagine that I could:

Either of them are not great. The first one seems messy. The second requires querying and reacting to issues with resources. Do you have any additional thoughts?

This is similar to issue #100.

webcoyote commented 9 years ago

It turns out that Chef makes this problem really hard to fix. Based on the idea of using marker files from https://github.com/mkocher/chef_deploy/blob/master/chef/cookbooks/joy_of_cooking/libraries/marker.rb, then something like this almost works:

remote_file "/tmp/web_ui.zip" do
  source 'https://dl.bintray.com/mitchellh/consul/0.5.0_web_ui.zip'
  checksum '0081d08be9c0b1172939e92af5a7cf9ba4f90e54fae24a353299503b24bb8be9'
  notifies :run, "ark_delete_marker[web_ui]", :immediately
end

ark_delete_marker "web_ui" do
  action :nothing
end

execute "unpack web_ui" do
  not_if { Chef::Recipe::marker_exists("web_ui") }
  command "unzip -o '/tmp/web_ui.zip' -d '/tmp/web_ui'"
end

ark_create_marker "web_ui"

"Test and repair" friendly:

  1. remote_file ensures that we don't download data unnecessarily.
  2. unpack web_ui only runs when new data is downloaded or the >first< installation failed.

But there are problems.

  1. After the first unzip a marker gets created indicating that installation was successful. If the web_ui.zip file changes after first installation then it would be re-downloaded, but if Chef then crashes after the remote_file but before notifies ark_delete_marker then future Chef runs will >not< unzip the data. While this works much better than the current solution in ark, it's not perfect
  2. Markers would need to be stored in directories based on the calling cookbook name to ensure that two different cookbooks don't use the same marker name ("web_ui" in the code above)
  3. Lots of extra boilerplate code required, which would be horrible if there are lots of resources to "mark".

The ideal solution to solve this problem would be to use a "before" notification, which has been discussed but not included in Chef. With a before notification, once remote_file has decided to download the file, but before the download completes, we'd delete the marker file so that the unpack operation would be guaranteed to run at some point in the future (whether during this Chef run or -- if a crash occurred -- during the next run).