Support for provisioning nvidia gpu agents

j-rivero commented 5 years ago

The PR implements the support for provisioning Nvidia GPU agents particularly the AWS nodes. Some details:

Add a new role called agent_gpu based on agent
Use lightdm to manage the launch of X server
Use lightdm display-setup-script feature to run an script to set proper permissions
Use custom xorg.conf customized to run on GRID K520 cards in a headless mode (more generic configurations did not work for me during my tests)
Install nvidia-docker2

Step to test it at AWS cloud:

Launch AMI 16.04 >> GPU instances g2.2xlarge
ssh -X ...
sudo apt-get update && sudo apt-get install -y librarian-puppet git mesa-utils && git clone https://github.com/j-rivero/buildfarm_deployment -b agent_gpu && git clone https://github.com/j-rivero/buildfarm_deployment_config -b agent_gpu && sed -i '51d' buildfarm_deployment_config/reconfigure.bash && sudo mv buildfarm_* /root/ && sudo su -
- cd buildfarm_deployment_config
- ./reconfigure.bash agent-gpu

TODO: for some reason after running the whole puppet code the X server started does not seem to have started with the right configuration loaded. It is necessary to restart the service to get it working correctly. I've tried to define the require clause explicitly to end with the service_lightdm_restart but no luck. Need your help here @nuclearsandwich

j-rivero commented 5 years ago

@j-rivero what's the best way to check that the right config is loaded? I can spin up test instances with the latest state but don't have a good way to test that the started service is in the expected state.

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

j-rivero commented 5 years ago

I usually install mesa-utils and run glxinfo . If it runs normally displaying the expected information this means to me that the right config is in place. If it displays clear error messages, bad luck.

Please note that given the permission setup by xhost.sh (called by lightdm) you need to run glxinfo from the jenkins-agent account.

j-rivero commented 5 years ago

When trying to run the code today in a new instance, the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Could not retrieve fact='gpu_device_bus_id', resolution='<anonymous>': Could not execute 'nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"': command not found

j-rivero commented 5 years ago

I found more problems using puppet 3.x from system on Xenial. @nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

diff --git a/modules/agent_files/templates/xorg.conf.erb b/modules/agent_files/templates/xorg.conf.erb
index a53865d..8daaac3 100644
--- a/modules/agent_files/templates/xorg.conf.erb
+++ b/modules/agent_files/templates/xorg.conf.erb
@@ -37,7 +37,7 @@ Section "Device"
     Driver         "nvidia"
     VendorName     "NVIDIA Corporation"
     BoardName      "GRID K520"
-    BusID          "<%= @facts['gpu_device_bus_id'] %>"
+    BusID          "<%= @gpu_device_bus_id %>"
 EndSection

 Section "Screen"
diff --git a/modules/facts/lib/facter/busid.rb b/modules/facts/lib/facter/busid.rb
deleted file mode 100644
index e05ca3a..0000000
--- a/modules/facts/lib/facter/busid.rb
+++ /dev/null
@@ -1,5 +0,0 @@
-Facter.add(:gpu_device_bus_id) do
-  setcode do
-    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
-  end
-end
diff --git a/modules/facts/lib/facter/gpu_device_bus_id.rb b/modules/facts/lib/facter/gpu_device_bus_id.rb
new file mode 100644
index 0000000..e05ca3a
--- /dev/null
+++ b/modules/facts/lib/facter/gpu_device_bus_id.rb
@@ -0,0 +1,5 @@
+Facter.add(:gpu_device_bus_id) do
+  setcode do
+    Facter::Core::Execution.execute('nvidia-xconfig --query-gpu-info  | grep BusID | sed "s/.*PCI:/PCI:/g"')
+  end
+end

nuclearsandwich commented 5 years ago

@nuclearsandwich are we targetting Puppet 4.x or Puppet 3.x in general?

Puppet 3.8 using the --future parser so that we're using the updated puppet language parser.

nuclearsandwich commented 5 years ago

the inclusion of nvidia-xconfig inside the xorg template is making the installation files like seems to be evaluated before nvidia packages are installed (even if the template option requires nvidia package):

Facts are collected before puppet runs. Which means that I don't think we can use a fact that comes from a command installed by puppet.

I can think of a couple of really dirty solutions (like shelling out in the embedded Ruby template :scream:) but it may be that we need to make it a part of the hiera configuration.

j-rivero commented 5 years ago

I've back to test this PR today and everything seems to work fine except the fact that a service lightdm restartis needed after provisioning. I think that the xhost.sh script is being executed (the touch command in there leaves a file in tmp) but for some reason the X server does not have the right permissions.

nuclearsandwich commented 5 years ago

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

j-rivero commented 5 years ago

service lightdm restart

This is usually a resource dependency issue but it may be that we need to notify from the xhost.sh script directly (rather than relying solely on the notify => Service[lightdm] in the /etc/lightdm/lightdm.conf resource. Especially on repeated runs as the notification won't be triggered if the lightdm.conf file is unchanged.

I moved the notify clause from lightdm.conf to xhost.sh file and make this last one to depend on the first one. No good result.

The only workaround I can see is to run the restart command in reconfigure.sh to restart lightdm after provisioning:

+if [[ $buildfarm_role == 'agent_gpu' ]]; then
+  service lightdm restart
+fi

ros-infrastructure / buildfarm_deployment

Support for provisioning nvidia gpu agents #211