install_agent task doesn't work on some OS versions

Describe the Bug

So far I've encountered this using generic/centos6, generic/debian9 & generic/debian10 Vagrant boxes. The install_agent task never completes on these boxes and when you Ctrl+C the Rake process it returns this sort of error:

$ bundle exec rake 'litmus:install_agent'
install_agent
^Crake aborted!
Interrupt: 
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:150:in `sleep'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:150:in `rescue in block (3 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:134:in `block (3 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:127:in `each'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:127:in `block (2 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:21:in `block in execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/client.rb:70:in `start_span'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:16:in `execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/rake-12.3.3/exe/rake:27:in `<top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `eval'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `<main>'

Caused by:
Error checking puppet version on {"target":"127.0.0.1:2200","action":"command","object":"puppet --version","status":"failure","value":{"stdout":"","stderr":"sh: 1: puppet: not found\n","merged_output":"sh: 1: puppet: not found\n","exit_code":127,"_error":{"kind":"puppetlabs.tasks/command-error","issue_code":"COMMAND_ERROR","msg":"The command failed with exit code 127","details":{"exit_code":127}}}}
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:137:in `block (4 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:136:in `each'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:136:in `block (3 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:127:in `each'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:127:in `block (2 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:21:in `block in execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/client.rb:70:in `start_span'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:16:in `execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/rake-12.3.3/exe/rake:27:in `<top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `eval'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `<main>'
Tasks: TOP => litmus:install_agent
(See full trace by running task with --trace)

The annoying thing is Puppet is installed by the task. If I vagrant ssh into any of these boxes there is the correct Puppet agent installed:

$ vagrant ssh
Last login: Mon Jun 14 16:21:07 2021 from 10.0.2.2
vagrant@debian9:~$ which puppet
/opt/puppetlabs/bin/puppet
vagrant@debian9:~$ puppet --version
7.7.0

If I re-run this task it just sits there sleeping until I Ctrl+C it.

If I ignore this failure and go to the install_module task then it immediately bails out that it can't find the Puppet agent:

$ bundle exec rake 'litmus:install_module'
Building '/Users/matt/Documents/Puppet/bodgit-dbus' into '/Users/matt/Documents/Puppet/bodgit-dbus/pkg'
Built '/Users/matt/Documents/Puppet/bodgit-dbus/pkg/bodgit-dbus-3.0.0.tar.gz'
rake aborted!
Installation of package bodgit-dbus-3.0.0.tar.gz failed.
Results:
  127.0.0.1:2200: {"stdout"=>"", "stderr"=>"sh: 1: puppet: not found\n", "merged_output"=>"sh: 1: puppet: not found\n", "exit_code"=>127, "_error"=>{"kind"=>"puppetlabs.tasks/command-error", "issue_code"=>"COMMAND_ERROR", "msg"=>"The command failed with exit code 127", "details"=>{"exit_code"=>127}}}
  127.0.0.1:2201: {"stdout"=>"", "stderr"=>"sh: 1: puppet: not found\n", "merged_output"=>"sh: 1: puppet: not found\n", "exit_code"=>127, "_error"=>{"kind"=>"puppetlabs.tasks/command-error", "issue_code"=>"COMMAND_ERROR", "msg"=>"The command failed with exit code 127", "details"=>{"exit_code"=>127}}}}
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_helper.rb:424:in `raise_bolt_errors'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_helper.rb:325:in `block in install_module'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/client.rb:70:in `start_span'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_helper.rb:302:in `install_module'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/puppet_litmus-0.27.0/lib/puppet_litmus/rake_tasks.rb:209:in `block (2 levels) in <top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:21:in `block in execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/client.rb:70:in `start_span'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/honeycomb-beeline-2.4.1/lib/honeycomb/integrations/rake.rb:16:in `execute'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/gems/rake-12.3.3/exe/rake:27:in `<top (required)>'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `eval'
/Users/matt/.rvm/gems/ruby-2.7.2@puppet/bin/ruby_executable_hooks:22:in `<main>'
Tasks: TOP => litmus:install_module
(See full trace by running task with --trace)

Expected Behavior

It shouldn't be failing because Puppet is installed.

Steps to Reproduce

Use the following provision.yaml:

---
vagrant:
  provisioner: vagrant
  images:
    - 'generic/centos6'
    - 'generic/debian9'
    - 'generic/debian10'

Run one/any/all of those as part of a Litmus workflow.

Environment

Version: 0.27.0
Platform: macOS

Additional Context

So far, I've found generic/centos7 & generic/centos8 do work correctly.

Using generic/centos6 and generic/centos7 I have an environment where the Puppet install is working one one box and broken on the other, even though Puppet is installed on both. I can confirm this using bolt:

$ bolt command run 'puppet --version' --targets ssh_nodes --inventory spec/fixtures/litmus_inventory.yaml 
Started on 127.0.0.1:2200...
Started on 127.0.0.1:2222...
Failed on 127.0.0.1:2222:
  The command failed with exit code 127
  sh: puppet: command not found
Finished on 127.0.0.1:2200:
  7.7.0
Successful on 1 target: 127.0.0.1:2200
Failed on 1 target: 127.0.0.1:2222
Ran on 2 targets in 0.93 sec

So clearly, $PATH doesn't include /opt/puppetlabs/puppet/bin for some reason. Grabbing the output of env with bolt:

$ bolt command run 'env' --targets ssh_nodes --inventory spec/fixtures/litmus_inventory.yaml 
Started on 127.0.0.1:2200...
Started on 127.0.0.1:2222...
Finished on 127.0.0.1:2222:
  TERM=unknown
  SHELL=/bin/bash
  HISTSIZE=100000
  USER=root
  SUDO_USER=vagrant
  SUDO_UID=500
  USERNAME=root
  PATH=/sbin:/bin:/usr/sbin:/usr/bin
  MAIL=/var/mail/vagrant
  PWD=/root
  LANG=en_GB.UTF-8
  SHLVL=1
  SUDO_COMMAND=/bin/sh -c cd; env
  HOME=/root
  LOGNAME=root
  SUDO_GID=500
  OLDPWD=/home/vagrant
  _=/bin/env
Finished on 127.0.0.1:2200:
  XDG_SESSION_ID=70
  TERM=unknown
  SHELL=/bin/bash
  HISTSIZE=100000
  USER=root
  SUDO_USER=vagrant
  SUDO_UID=1000
  USERNAME=root
  PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin
  MAIL=/var/mail/vagrant
  PWD=/root
  LANG=en_GB.UTF-8
  SHLVL=1
  SUDO_COMMAND=/bin/sh -c cd; env
  HOME=/root
  LOGNAME=root
  SUDO_GID=1000
  OLDPWD=/home/vagrant
  _=/bin/env
Successful on 2 targets: 127.0.0.1:2222,127.0.0.1:2200
Ran on 2 targets in 0.37 sec

/opt/puppetlabs/puppet/bin is absent from $PATH on the broken box, and is present many, many times on the other. Looking directly on the boxes, it looks like /etc/environment is where this is being added and this file is not being read on some boxes. There's also a bug that $PATH is being appended to multiple times, as alluded to in puppetlabs/puppet_litmus#399 this code doesn't check that $PATH doesn't already contain /opt/puppetlabs/puppet/bin:

https://github.com/puppetlabs/puppet_litmus/blob/0c4fb04db69bca2d0eb31f4e4bafe055b8018343/lib/puppet_litmus/rake_helper.rb#L235

So $PATH just grows over time. I can see on the broken box that /etc/environment has been updated, but it's not being read for some reason. I will try and figure out why as that seems to be the crux of the problem.

Okay, it's the Sudo configuration on the box that is causing this, so it's the same bug as puppetlabs/puppet_litmus#266. It appears to be kept as the OS default, it's not been changed apart from dropping in a vagrant-specific snippet so there's no password prompting for that user:

Defaults   !visiblepw
Defaults    always_set_home
Defaults    env_reset
Defaults    env_keep =  "COLORS DISPLAY HOSTNAME HISTSIZE INPUTRC KDEDIR LS_COLORS"
Defaults    env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults    env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults    env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults    env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin
root    ALL=(ALL)   ALL
vagrant        ALL=(ALL)       NOPASSWD: ALL

The secure_path directive needs to either be disabled or add a Defaults exempt_group += vagrant directive and either the env_* directives need to also be disabled or add a Defaults env_keep += "PATH" directive to preserve the path. If I manually add those changes to the provisioned box, the install_agent task completes.

Weirdly, on the box that does work, the Sudo configuration isn't that much different:

Defaults   !visiblepw
Defaults    always_set_home
Defaults    match_group_by_gid
Defaults    always_query_group_plugin
Defaults    env_reset
Defaults    env_keep =  "COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS"
Defaults    env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults    env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults    env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults    env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin
root    ALL=(ALL)   ALL
%wheel  ALL=(ALL)   ALL
vagrant        ALL=(ALL)       NOPASSWD: ALL

Even though the same directives are set, the path changes are picked up.

Using one SSH user to log in and then sudo to root is a perfectly legitimate workflow, so what's the fix here?

Change every Vagrant box/container image so that the Sudo configuration is less restrictive
Add some sort of Bolt task to achieve the above; it literally just needs to append Defaults env_keep += "PATH" and Defaults exempt_group += vagrant to the end of /etc/sudoers.d/vagrant.
Find some other mechanism to inject additional directories into $PATH although I suspect you'll still be bitten by the Sudo configuration

There's also still two separate bugs here as well:

If the Puppet agent cannot be found on the system, the Rake install_agent task just hangs indefinitely
The path is being appended to regardless of whether it already contains the Puppet agent installation or not

After looking at some of the code, I wrote a small Rake task that uses Bolt to adjust the Sudo configuration:

# frozen_string_literal: true

require 'rake'

namespace :litmus do
  require 'puppet_litmus/inventory_manipulation'

  desc 'fix up sudo configuration on all or a specified set of targets'
  task :sudo_fix, [:target_node_name] do |_task, args|
    inventory_hash = inventory_hash_from_inventory_file
    target_nodes = find_targets(inventory_hash, args[:target_node_name])
    if target_nodes.empty?
      puts 'No targets found'
      exit 0
    end
    require 'bolt_spec/run'
    include BoltSpec::Run
    Rake::Task['spec_prep'].invoke
    bolt_result = run_command('printf "PATH=/sbin:/bin:/usr/sbin:/usr/bin\\n" >/etc/environment && printf "Defaults env_keep += \\"PATH\\"\\nDefaults exempt_group += vagrant\\n" >/etc/sudoers.d/litmus', target_nodes, config: nil, inventory: inventory_hash.clone)
    raise_bolt_errors(bolt_result, 'Fix of sudo configuration failed.')
    bolt_result
  end
end

I dropped it into my module as rakelib/sudo.rake so it gets automatically picked up. One problem I found on at least Debian 9 & 10 and CentOS 6 is that by disabling the default Sudo path means you lose the various sbin directories which causes some breakage so I also add a sane default to /etc/environment which is then later appended to.

So the flow is now litmus:provision_list[vagrant] -> litmus:sudo_fix[ssh_nodes] -> litmus:install_agent -> ... and it seems to work correctly.

Hi @bodgit - thanks for the detailed write up and the bolt task to modify the sudo config.

There is a bolt task called fix_secure_path which resides in our provision repo which I think is trying to achieve the same thing you are doing.

I reproduced the issue you encountered with the install_agent task hanging, then spun up a new box and ran this task before the install_agent task and it worked successfully:

bundle exec rake 'litmus:provision[vagrant,generic/debian10]'
BOLT_GEM=true bundle exec bolt --modulepath spec/fixtures/modules task run provision::fix_secure_path --inventory spec/fixtures/litmus_inventory.yaml --targets ssh_nodes
bundle exec rake 'litmus:install_agent'

That task should be available if you've generated your module using the PDK and it set up the spec_helper Gem's Rake tasks and .fixtures.yml. Give me a shout if you're not set up this way and I can walk you through setting up manually.

@sanfrancrisko Ah, yes, I do have that task, I hadn't spotted it in the provision repo. Using that does seem to have the same overall effect.

Is it worth wrapping that in a Rake task or running it within one of the other tasks that already exist as it's quite cumbersome to type all of that bolt command?

@bodgit Yeah, this should be more obvious - 100% agree.

I was speaking with @carabasdaniel about this - I think what I want to do is see if we can resolve the issue in the Vagrant provisioner. If we can set the root account in the inventory.yaml, it should hopefully take care of the subsequent issues with the agent installation not being on path.

If I can't get that resolved, I'll wrap this task in a Rake task at the Litmus level so it's easier to access

Moving this issue to the provision module repo, as the issue stems from the vagrant provisioner task. Setting the 'user' => root over 'user' => remote_config['user'] for ssh_nodes in ./tasks/vagrant.rb should solve this issue, but we should investigate the potential knock-on effects from this.

puppetlabs / provision