Closed bodgit closed 1 year ago
Using generic/centos6
and generic/centos7
I have an environment where the Puppet install is working one one box and broken on the other, even though Puppet is installed on both. I can confirm this using bolt:
$ bolt command run 'puppet --version' --targets ssh_nodes --inventory spec/fixtures/litmus_inventory.yaml
Started on 127.0.0.1:2200...
Started on 127.0.0.1:2222...
Failed on 127.0.0.1:2222:
The command failed with exit code 127
sh: puppet: command not found
Finished on 127.0.0.1:2200:
7.7.0
Successful on 1 target: 127.0.0.1:2200
Failed on 1 target: 127.0.0.1:2222
Ran on 2 targets in 0.93 sec
So clearly, $PATH
doesn't include /opt/puppetlabs/puppet/bin
for some reason. Grabbing the output of env
with bolt:
$ bolt command run 'env' --targets ssh_nodes --inventory spec/fixtures/litmus_inventory.yaml
Started on 127.0.0.1:2200...
Started on 127.0.0.1:2222...
Finished on 127.0.0.1:2222:
TERM=unknown
SHELL=/bin/bash
HISTSIZE=100000
USER=root
SUDO_USER=vagrant
SUDO_UID=500
USERNAME=root
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAIL=/var/mail/vagrant
PWD=/root
LANG=en_GB.UTF-8
SHLVL=1
SUDO_COMMAND=/bin/sh -c cd; env
HOME=/root
LOGNAME=root
SUDO_GID=500
OLDPWD=/home/vagrant
_=/bin/env
Finished on 127.0.0.1:2200:
XDG_SESSION_ID=70
TERM=unknown
SHELL=/bin/bash
HISTSIZE=100000
USER=root
SUDO_USER=vagrant
SUDO_UID=1000
USERNAME=root
PATH=/sbin:/bin:/usr/sbin:/usr/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin:/opt/puppetlabs/puppet/bin
MAIL=/var/mail/vagrant
PWD=/root
LANG=en_GB.UTF-8
SHLVL=1
SUDO_COMMAND=/bin/sh -c cd; env
HOME=/root
LOGNAME=root
SUDO_GID=1000
OLDPWD=/home/vagrant
_=/bin/env
Successful on 2 targets: 127.0.0.1:2222,127.0.0.1:2200
Ran on 2 targets in 0.37 sec
/opt/puppetlabs/puppet/bin
is absent from $PATH
on the broken box, and is present many, many times on the other. Looking directly on the boxes, it looks like /etc/environment
is where this is being added and this file is not being read on some boxes. There's also a bug that $PATH
is being appended to multiple times, as alluded to in puppetlabs/puppet_litmus#399 this code doesn't check that $PATH
doesn't already contain /opt/puppetlabs/puppet/bin
:
So $PATH
just grows over time. I can see on the broken box that /etc/environment
has been updated, but it's not being read for some reason. I will try and figure out why as that seems to be the crux of the problem.
Okay, it's the Sudo configuration on the box that is causing this, so it's the same bug as puppetlabs/puppet_litmus#266. It appears to be kept as the OS default, it's not been changed apart from dropping in a vagrant-specific snippet so there's no password prompting for that user:
Defaults !visiblepw
Defaults always_set_home
Defaults env_reset
Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE INPUTRC KDEDIR LS_COLORS"
Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin
root ALL=(ALL) ALL
vagrant ALL=(ALL) NOPASSWD: ALL
The secure_path
directive needs to either be disabled or add a Defaults exempt_group += vagrant
directive and either the env_*
directives need to also be disabled or add a Defaults env_keep += "PATH"
directive to preserve the path. If I manually add those changes to the provisioned box, the install_agent task completes.
Weirdly, on the box that does work, the Sudo configuration isn't that much different:
Defaults !visiblepw
Defaults always_set_home
Defaults match_group_by_gid
Defaults always_query_group_plugin
Defaults env_reset
Defaults env_keep = "COLORS DISPLAY HOSTNAME HISTSIZE KDEDIR LS_COLORS"
Defaults env_keep += "MAIL PS1 PS2 QTDIR USERNAME LANG LC_ADDRESS LC_CTYPE"
Defaults env_keep += "LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES"
Defaults env_keep += "LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE"
Defaults env_keep += "LC_TIME LC_ALL LANGUAGE LINGUAS _XKB_CHARSET XAUTHORITY"
Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin
root ALL=(ALL) ALL
%wheel ALL=(ALL) ALL
vagrant ALL=(ALL) NOPASSWD: ALL
Even though the same directives are set, the path changes are picked up.
Using one SSH user to log in and then sudo to root is a perfectly legitimate workflow, so what's the fix here?
Defaults env_keep += "PATH"
and Defaults exempt_group += vagrant
to the end of /etc/sudoers.d/vagrant
.$PATH
although I suspect you'll still be bitten by the Sudo configurationThere's also still two separate bugs here as well:
After looking at some of the code, I wrote a small Rake task that uses Bolt to adjust the Sudo configuration:
# frozen_string_literal: true
require 'rake'
namespace :litmus do
require 'puppet_litmus/inventory_manipulation'
desc 'fix up sudo configuration on all or a specified set of targets'
task :sudo_fix, [:target_node_name] do |_task, args|
inventory_hash = inventory_hash_from_inventory_file
target_nodes = find_targets(inventory_hash, args[:target_node_name])
if target_nodes.empty?
puts 'No targets found'
exit 0
end
require 'bolt_spec/run'
include BoltSpec::Run
Rake::Task['spec_prep'].invoke
bolt_result = run_command('printf "PATH=/sbin:/bin:/usr/sbin:/usr/bin\\n" >/etc/environment && printf "Defaults env_keep += \\"PATH\\"\\nDefaults exempt_group += vagrant\\n" >/etc/sudoers.d/litmus', target_nodes, config: nil, inventory: inventory_hash.clone)
raise_bolt_errors(bolt_result, 'Fix of sudo configuration failed.')
bolt_result
end
end
I dropped it into my module as rakelib/sudo.rake
so it gets automatically picked up. One problem I found on at least Debian 9 & 10 and CentOS 6 is that by disabling the default Sudo path means you lose the various sbin directories which causes some breakage so I also add a sane default to /etc/environment
which is then later appended to.
So the flow is now litmus:provision_list[vagrant]
-> litmus:sudo_fix[ssh_nodes]
-> litmus:install_agent
-> ... and it seems to work correctly.
Hi @bodgit - thanks for the detailed write up and the bolt task to modify the sudo config.
There is a bolt task called fix_secure_path which resides in our provision repo which I think is trying to achieve the same thing you are doing.
I reproduced the issue you encountered with the install_agent
task hanging, then spun up a new box and ran this task before the install_agent
task and it worked successfully:
bundle exec rake 'litmus:provision[vagrant,generic/debian10]'
BOLT_GEM=true bundle exec bolt --modulepath spec/fixtures/modules task run provision::fix_secure_path --inventory spec/fixtures/litmus_inventory.yaml --targets ssh_nodes
bundle exec rake 'litmus:install_agent'
That task should be available if you've generated your module using the PDK and it set up the spec_helper Gem's Rake tasks and .fixtures.yml
. Give me a shout if you're not set up this way and I can walk you through setting up manually.
@sanfrancrisko Ah, yes, I do have that task, I hadn't spotted it in the provision repo. Using that does seem to have the same overall effect.
Is it worth wrapping that in a Rake task or running it within one of the other tasks that already exist as it's quite cumbersome to type all of that bolt command?
@bodgit Yeah, this should be more obvious - 100% agree.
I was speaking with @carabasdaniel about this - I think what I want to do is see if we can resolve the issue in the Vagrant provisioner. If we can set the root
account in the inventory.yaml
, it should hopefully take care of the subsequent issues with the agent installation not being on path.
If I can't get that resolved, I'll wrap this task in a Rake task at the Litmus level so it's easier to access
Moving this issue to the provision module repo, as the issue stems from the vagrant provisioner task.
Setting the 'user' => root
over 'user' => remote_config['user']
for ssh_nodes in ./tasks/vagrant.rb should solve this issue, but we should investigate the potential knock-on effects from this.
Describe the Bug
So far I've encountered this using
generic/centos6
,generic/debian9
&generic/debian10
Vagrant boxes. The install_agent task never completes on these boxes and when you Ctrl+C the Rake process it returns this sort of error:The annoying thing is Puppet is installed by the task. If I
vagrant ssh
into any of these boxes there is the correct Puppet agent installed:If I re-run this task it just sits there sleeping until I Ctrl+C it.
If I ignore this failure and go to the install_module task then it immediately bails out that it can't find the Puppet agent:
Expected Behavior
It shouldn't be failing because Puppet is installed.
Steps to Reproduce
Use the following provision.yaml:
Run one/any/all of those as part of a Litmus workflow.
Environment
Additional Context
So far, I've found
generic/centos7
&generic/centos8
do work correctly.