Open j-rivero opened 5 years ago
@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.
I usually install mesa-utils
and run glxinfo
. If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.
I usually install
mesa-utils
and runglxinfo
. If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.
Please note that given the permission setup by xhost.sh
(called by lightdm) you need to run glxinfo from the jenkins-agent
account.
When trying to run the code today in a new instance, the inclusion of nvidia-xconfig
inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):
Could not retrieve fact='gpu_device_bus_id', resolution='<anonymous>': Could not execute 'nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"': command not found
I found more problems using puppet 3.x from system on Xenial. @nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?
diff --git a/modules/agent_files/templates/xorg.conf.erb b/modules/agent_files/templates/xorg.conf.erb
index a53865d..8daaac3 100644
--- a/modules/agent_files/templates/xorg.conf.erb
+++ b/modules/agent_files/templates/xorg.conf.erb
@@ -37,7 +37,7 @@ Section "Device"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GRID K520"
- BusID "<%= @facts['gpu_device_bus_id'] %>"
+ BusID "<%= @gpu_device_bus_id %>"
EndSection
Section "Screen"
diff --git a/modules/facts/lib/facter/busid.rb b/modules/facts/lib/facter/busid.rb
deleted file mode 100644
index e05ca3a..0000000
--- a/modules/facts/lib/facter/busid.rb
+++ /dev/null
@@ -1,5 +0,0 @@
-Facter.add(:gpu_device_bus_id) do
- setcode do
- Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"')
- end
-end
diff --git a/modules/facts/lib/facter/gpu_device_bus_id.rb b/modules/facts/lib/facter/gpu_device_bus_id.rb
new file mode 100644
index 0000000..e05ca3a
--- /dev/null
+++ b/modules/facts/lib/facter/gpu_device_bus_id.rb
@@ -0,0 +1,5 @@
+Facter.add(:gpu_device_bus_id) do
+ setcode do
+ Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info | grep BusID | sed "s/.*PCI:/PCI:/g"')
+ end
+end
@nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?
Puppet 3.8 using the --future
parser so that we're using the updated puppet language parser.
the inclusion of
nvidia-xconfig
inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):
Facts are collected before puppet runs. Which means that I don't think we can use a fact that comes from a command installed by puppet.
I can think of a couple of really dirty solutions (like shelling out in the embedded Ruby template :scream:) but it may be that we need to make it a part of the hiera configuration.
I've back to test this PR today and everything seems to work fine except the fact that a service lightdm restart
is needed after provisioning. I think that the xhost.sh script is being executed (the touch command in there leaves a file in tmp) but for some reason the X server does not have the right permissions.
service lightdm restart
This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm]
in the /etc/lightdm/lightdm.conf
resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.
service lightdm restart
This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the
notify => Service[lightdm]
in the/etc/lightdm/lightdm.conf
resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.
I moved the notify clause from lightdm.conf
to xhost.sh
file and make this last one to depend on the first one. No good result.
The only workaround I can see is to run the restart command in reconfigure.sh
to restart lightdm after provisioning:
+if [[ $buildfarm_role == 'agent_gpu' ]]; then
+ service lightdm restart
+fi
The PR implements the support for provisioning Nvidia GPU agents particularly the AWS nodes. Some details:
agent_gpu
based onagent
display-setup-script
feature to run an script to set proper permissionsxorg.conf
customized to run on GRID K520 cards in a headless mode (more generic configurations did not work for me during my tests)nvidia-docker2
Step to test it at AWS cloud:
sudo apt-get update && sudo apt-get install -y librarian-puppet git mesa-utils && git clone https://github.com/j-rivero/buildfarm_deployment -b agent_gpu && git clone https://github.com/j-rivero/buildfarm_deployment_config -b agent_gpu && sed -i '51d' buildfarm_deployment_config/reconfigure.bash && sudo mv buildfarm_* /root/ && sudo su -
cd buildfarm_deployment_config
./reconfigure.bash agent-gpu
TODO: for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded. It is necessary to restart the service to get it working correctly. I've tried to define the require clause explicitly to end with the
service_lightdm_restart
but no luck. Need your help here @nuclearsandwich