nebari-dev / nebari-slurm

An opinionated open source deployment of jupyterhub based on an Slurm job scheduler.
BSD 3-Clause "New" or "Revised" License
28 stars 10 forks source link

[Bug] Recent deployment with libvirt on Ubuntu2004 test failed #145

Closed viniciusdc closed 7 months ago

viniciusdc commented 8 months ago

Context

TASK [slurm : Install slurm controller packages] *******************************
fatal: [hpc01-test]: FAILED! => {
  "cache_update_time": 1709042642,
  "cache_updated": false,
  "changed": false,
  "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\" install 'slurmdbd=19.05.5-1'' failed: E: Sub-process /usr/bin/dpkg returned an error code (1)",
  "rc": 100,
  "stderr": "E: Sub-process /usr/bin/dpkg returned an error code (1)\n",
  "stderr_lines": [
    "E: Sub-process /usr/bin/dpkg returned an error code (1)"
  ],
  "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nslurmdbd is already the newest version (19.05.5-1).\n0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.\n1 not fully installed or removed.\nAfter this operation, 0 B of additional disk space will be used.\nSetting up slurmdbd (19.05.5-1) ...\r\nJob for slurmdbd.service failed because the control process exited with error code.\r\nSee \"systemctl status slurmdbd.service\" and \"journalctl -xe\" for details.\r\ninvoke-rc.d: initscript slurmdbd, action \"start\" failed.\r\n● slurmdbd.service - Slurm DBD accounting daemon\r\n     Loaded: loaded (/etc/systemd/system/slurmdbd.service; disabled; vendor preset: enabled)\r\n     Active: failed (Result: exit-code) since Tue 2024-02-27 14:26:26 UTC; 10ms ago\r\n    Process: 32608 ExecStart=/usr/sbin/slurmdbd (code=exited, status=1/FAILURE)\r\n\nFeb 27 14:26:26 hpc01-test systemd[1]: Starting Slurm DBD accounting daemon...\r\nFeb 27 14:26:26 hpc01-test slurmdbd[32608]: No slurmdbd.conf file (/etc/slurm-llnl/slurmdbd.conf)\r\nFeb 27 14:26:26 hpc01-test slurmdbd[32608]: error: slurmdbd.conf lacks DbdHost parameter, using 'localhost'\r\nFeb 27 14:26:26 hpc01-test slurmdbd[32608]: fatal: StorageType must be specified\r\nFeb 27 14:26:26 hpc01-test systemd[1]: slurmdbd.service: Control process exited, code=exited, status=1/FAILURE\r\nFeb 27 14:26:26 hpc01-test systemd[1]: slurmdbd.service: Failed with result 'exit-code'.\r\nFeb 27 14:26:26 hpc01-test systemd[1]: Failed to start Slurm DBD accounting daemon.\r\ndpkg: error processing package slurmdbd (--configure):\r\n installed slurmdbd package post-installation script subprocess returned error exit status 1\r\nErrors were encountered while processing:\r\n slurmdbd\n",
  "stdout_lines": [
    "Reading package lists...",
    "Building dependency tree...",
    "Reading state information...",
    "slurmdbd is already the newest version (19.05.5-1).",
    "0 upgraded, 0 newly installed, 0 to remove and 41 not upgraded.",
    "1 not fully installed or removed.",
    "After this operation, 0 B of additional disk space will be used.",
    "Setting up slurmdbd (19.05.5-1) ...",
    "Job for slurmdbd.service failed because the control process exited with error code.",
    "See \"systemctl status slurmdbd.service\" and \"journalctl -xe\" for details.",
    "invoke-rc.d: initscript slurmdbd, action \"start\" failed.",
    "● slurmdbd.service - Slurm DBD accounting daemon",
    "     Loaded: loaded (/etc/systemd/system/slurmdbd.service; disabled; vendor preset: enabled)",
    "     Active: failed (Result: exit-code) since Tue 2024-02-27 14:26:26 UTC; 10ms ago",
    "    Process: 32608 ExecStart=/usr/sbin/slurmdbd (code=exited, status=1/FAILURE)",
    "",
    "Feb 27 14:26:26 hpc01-test systemd[1]: Starting Slurm DBD accounting daemon...",
    "Feb 27 14:26:26 hpc01-test slurmdbd[32608]: No slurmdbd.conf file (/etc/slurm-llnl/slurmdbd.conf)",
    "Feb 27 14:26:26 hpc01-test slurmdbd[32608]: error: slurmdbd.conf lacks DbdHost parameter, using 'localhost'",
    "Feb 27 14:26:26 hpc01-test slurmdbd[32608]: fatal: StorageType must be specified",
    "Feb 27 14:26:26 hpc01-test systemd[1]: slurmdbd.service: Control process exited, code=exited, status=1/FAILURE",
    "Feb 27 14:26:26 hpc01-test systemd[1]: slurmdbd.service: Failed with result 'exit-code'.",
    "Feb 27 14:26:26 hpc01-test systemd[1]: Failed to start Slurm DBD accounting daemon.",
    "dpkg: error processing package slurmdbd (--configure):",
    " installed slurmdbd package post-installation script subprocess returned error exit status 1",
    "Errors were encountered while processing:",
    " slurmdbd"
  ]
}

Value and/or benefit

Fixing this would be great

Anything else?

No response

aktech commented 8 months ago

I think this is fine, we're not supporting anything < Ubuntu 2204

viniciusdc commented 8 months ago

I found the problem. Those were the main problems I encountered while trying out deploying:

To avoid that, we just need to add and extra -f or SLURM_CONF env vars to each service to override the .conf locations.