taskcluster / community-tc-config

Configuration for Taskcluster at https://community-tc.services.mozilla.com/
Mozilla Public License 2.0
11 stars 33 forks source link

NVIDIA drivers not getting installed on proj-taskcluster/gw-windows-2022-gpu #767

Open petemoore opened 4 days ago

petemoore commented 4 days ago

See e.g. https://community-tc.services.mozilla.com/tasks/e47vn7LJQ3qMCsfk4ybnRA/runs/0/logs/public/logs/live.log

C:\Users\task_171923092126336>dir C:\ 
 Volume in drive C is Windows
 Volume Serial Number is 12BD-F374

 Directory of C:\

06/24/2024  12:08 PM    <DIR>          AzureData
06/24/2024  12:42 AM    <DIR>          cygwin
06/24/2024  12:41 AM         1,407,507 cygwin-setup-x86_64.exe
06/24/2024  12:42 AM    <DIR>          DependencyWalker
06/24/2024  12:42 AM           468,618 depends22_x64.zip
06/24/2024  12:09 PM    <DIR>          generic-worker
06/24/2024  12:40 AM        65,350,776 Git-2.44.0-64-bit.exe
06/24/2024  12:40 AM         2,152,690 git_install.log
06/24/2024  12:39 AM    <DIR>          go
06/24/2024  12:36 AM        76,244,044 go1.22.2.windows-amd64.zip
06/24/2024  12:36 AM    <DIR>          gopath
06/24/2024  12:36 AM         8,035,465 gvim80-069.exe
06/24/2024  12:34 AM             9,936 install_env.txt
06/24/2024  12:40 AM        26,554,368 NodeSetup.msi
06/24/2024  12:36 AM    <DIR>          nssm-2.24
06/24/2024  12:36 AM           351,793 nssm-2.24.zip
06/24/2024  12:34 AM    <DIR>          Packages
05/08/2021  08:20 AM    <DIR>          PerfLogs
06/24/2024  12:42 AM    <DIR>          ProcessExplorer
06/24/2024  12:42 AM         3,459,165 ProcessExplorer.zip
06/24/2024  12:42 AM    <DIR>          ProcessMonitor
06/24/2024  12:42 AM         3,013,762 ProcessMonitor.zip
06/24/2024  12:59 AM    <DIR>          Program Files
06/24/2024  12:59 AM    <DIR>          Program Files (x86)
06/24/2024  12:40 AM        26,216,840 python-3.11.9-amd64.exe
06/24/2024  12:41 AM            72,064 python-install-log.txt
06/24/2024  12:41 AM            88,422 python-install-log_000_core_AllUsers.txt
06/24/2024  12:41 AM           116,216 python-install-log_001_exe_AllUsers.txt
06/24/2024  12:41 AM           454,052 python-install-log_002_dev_AllUsers.txt
06/24/2024  12:41 AM         1,870,948 python-install-log_003_lib_AllUsers.txt
06/24/2024  12:41 AM         3,269,400 python-install-log_004_test_AllUsers.txt
06/24/2024  12:41 AM         1,229,374 python-install-log_005_doc_AllUsers.txt
06/24/2024  12:41 AM           279,836 python-install-log_006_tools_AllUsers.txt
06/24/2024  12:41 AM         3,057,268 python-install-log_007_tcltk_AllUsers.txt
06/24/2024  12:41 AM           110,000 python-install-log_008_launcher_AllUsers.txt
06/24/2024  12:41 AM           115,628 python-install-log_009_pip_AllUsers.txt
06/19/2024  01:33 AM    <DIR>          Temp
06/24/2024  12:09 PM    <DIR>          Users
06/24/2024  12:09 PM    <DIR>          Windows
06/24/2024  12:08 PM    <DIR>          WindowsAzure
06/24/2024  12:08 PM    <DIR>          worker-runner
              23 File(s)    223,928,172 bytes
              18 Dir(s)  101,699,657,728 bytes free
[taskcluster 2024-06-24T12:09:08.482Z]    Exit Code: 0
[taskcluster 2024-06-24T12:09:08.482Z]    User Time: 0s
[taskcluster 2024-06-24T12:09:08.482Z]  Kernel Time: 15.625ms
[taskcluster 2024-06-24T12:09:08.482Z]    Wall Time: 72.579ms
[taskcluster 2024-06-24T12:09:08.482Z]       Result: SUCCEEDED

The code which installs the driver is: https://github.com/taskcluster/community-tc-config/blob/f5ed4c5305e0d275cd971f441a1c046f8b9c9c35/imagesets/generic-worker-win2022/bootstrap.ps1#L220-L229

This should have downloaded the file C:\nvidia_driver.exe but that file does not appear in the directory listing of C:\ above.

petemoore commented 4 days ago

Note, this is a Standard_NV12s_v3 instance type, running in Azure, which has an NVIDIA GPU.

petemoore commented 19 hours ago

In FXCI, I believe Windows 11 is used instead of Windows Server 2022. Probably need to add some debug to the powershell above.

@jwmoss Have you also tried this driver on Windows Server machines, or only on Windows Desktop editions? Maybe I'll switch our workers to Windows 11... Currently we are using MicrosoftWindowsServer:WindowsServer:2022-datacenter-azure-edition:latest as the base image...

jwmoss commented 18 hours ago

See e.g. https://community-tc.services.mozilla.com/tasks/e47vn7LJQ3qMCsfk4ybnRA/runs/0/logs/public/logs/live.log

C:\Users\task_171923092126336>dir C:\ 
 Volume in drive C is Windows
 Volume Serial Number is 12BD-F374

 Directory of C:\

06/24/2024  12:08 PM    <DIR>          AzureData
06/24/2024  12:42 AM    <DIR>          cygwin
06/24/2024  12:41 AM         1,407,507 cygwin-setup-x86_64.exe
06/24/2024  12:42 AM    <DIR>          DependencyWalker
06/24/2024  12:42 AM           468,618 depends22_x64.zip
06/24/2024  12:09 PM    <DIR>          generic-worker
06/24/2024  12:40 AM        65,350,776 Git-2.44.0-64-bit.exe
06/24/2024  12:40 AM         2,152,690 git_install.log
06/24/2024  12:39 AM    <DIR>          go
06/24/2024  12:36 AM        76,244,044 go1.22.2.windows-amd64.zip
06/24/2024  12:36 AM    <DIR>          gopath
06/24/2024  12:36 AM         8,035,465 gvim80-069.exe
06/24/2024  12:34 AM             9,936 install_env.txt
06/24/2024  12:40 AM        26,554,368 NodeSetup.msi
06/24/2024  12:36 AM    <DIR>          nssm-2.24
06/24/2024  12:36 AM           351,793 nssm-2.24.zip
06/24/2024  12:34 AM    <DIR>          Packages
05/08/2021  08:20 AM    <DIR>          PerfLogs
06/24/2024  12:42 AM    <DIR>          ProcessExplorer
06/24/2024  12:42 AM         3,459,165 ProcessExplorer.zip
06/24/2024  12:42 AM    <DIR>          ProcessMonitor
06/24/2024  12:42 AM         3,013,762 ProcessMonitor.zip
06/24/2024  12:59 AM    <DIR>          Program Files
06/24/2024  12:59 AM    <DIR>          Program Files (x86)
06/24/2024  12:40 AM        26,216,840 python-3.11.9-amd64.exe
06/24/2024  12:41 AM            72,064 python-install-log.txt
06/24/2024  12:41 AM            88,422 python-install-log_000_core_AllUsers.txt
06/24/2024  12:41 AM           116,216 python-install-log_001_exe_AllUsers.txt
06/24/2024  12:41 AM           454,052 python-install-log_002_dev_AllUsers.txt
06/24/2024  12:41 AM         1,870,948 python-install-log_003_lib_AllUsers.txt
06/24/2024  12:41 AM         3,269,400 python-install-log_004_test_AllUsers.txt
06/24/2024  12:41 AM         1,229,374 python-install-log_005_doc_AllUsers.txt
06/24/2024  12:41 AM           279,836 python-install-log_006_tools_AllUsers.txt
06/24/2024  12:41 AM         3,057,268 python-install-log_007_tcltk_AllUsers.txt
06/24/2024  12:41 AM           110,000 python-install-log_008_launcher_AllUsers.txt
06/24/2024  12:41 AM           115,628 python-install-log_009_pip_AllUsers.txt
06/19/2024  01:33 AM    <DIR>          Temp
06/24/2024  12:09 PM    <DIR>          Users
06/24/2024  12:09 PM    <DIR>          Windows
06/24/2024  12:08 PM    <DIR>          WindowsAzure
06/24/2024  12:08 PM    <DIR>          worker-runner
              23 File(s)    223,928,172 bytes
              18 Dir(s)  101,699,657,728 bytes free
[taskcluster 2024-06-24T12:09:08.482Z]    Exit Code: 0
[taskcluster 2024-06-24T12:09:08.482Z]    User Time: 0s
[taskcluster 2024-06-24T12:09:08.482Z]  Kernel Time: 15.625ms
[taskcluster 2024-06-24T12:09:08.482Z]    Wall Time: 72.579ms
[taskcluster 2024-06-24T12:09:08.482Z]       Result: SUCCEEDED

The code which installs the driver is:

https://github.com/taskcluster/community-tc-config/blob/f5ed4c5305e0d275cd971f441a1c046f8b9c9c35/imagesets/generic-worker-win2022/bootstrap.ps1#L220-L229

This should have downloaded the file C:\nvidia_driver.exe but that file does not appear in the directory listing of C:\ above.

https://github.com/taskcluster/community-tc-config/blob/f5ed4c5305e0d275cd971f441a1c046f8b9c9c35/imagesets/generic-worker-win2022/bootstrap.ps1#L220-L229

$hasNvidiaGpu will be $null because Win32_VideoController will only expose the video controller once the driver is installed. Because the driver is not installed on the base image (and shouldn't be, you should only install it on the workerPool that needs it), I suggest having a separate bootstrap for the workerPool that needs anything more than the standard software (including gpu drivers). In fxci we handle this by triggering a startup script that installs nvidia drivers on boot for a workerPool matching gpu.