lyulyul / shine-cluster

Simple High performance Infrastructure for Neural network Experiments
GNU General Public License v3.0
14 stars 8 forks source link

Add naboo to the cluster #185

Open luoyuqi-lab opened 8 months ago

luoyuqi-lab commented 8 months ago

I open naboo, a GPU without power, I shutdown it's PCIE.

IP is : 172.25.17.19

Need install slurm and this repo.

8e6100c72b66a15e14423017071447c

bfc6b0c591e86f5adc2f710d49b751e

luoyuqi-lab commented 8 months ago

BTW, the host name is dl now, not naboo like the memo on the machine. I do not know why.

luoyuqi-lab commented 4 months ago

image 有张卡不能用,从pci上关了先

luoyuqi-lab commented 4 months ago

准备安装slurm和github repo,只用这七张卡

luoyuqi-lab commented 4 months ago

naboo已经安装好,并测试能用slurm,conda,以及接入shard。请 @lyulyul 写清楚安装步骤,follow:

安装ubuntu 安装nvidia驱动,如果没有 安装slurmd 安装munge 安装随机GPU工具 修改各种配置,包括hosts, slurm.conf等 重启各种服务,建立aha和计算节点的连接 用户迁移,以及shared,conda等工具的挂载 完成后可以关闭这个issue。不过请写详细一点以防未来忘记,不要像我装hoth时候的记录一样,导致我们今天很疑惑😅。

I believe LV @lyulyul and her partner have completely figured out how to load a new compute node. Although it took us an extra 2 hours due to my careless modification of slurm.conf, but anyway, now we made it. I'm very happy with the attitude of both LV and her partner. It's nice to work together and solve problems. You guys are shorter than me but smarter than me. 😊

lyulyul commented 4 months ago

naboo加入集群 1.slurm版本与ubuntu版本不对应,官网中最老的slurm版本为 slurm-20.11,在无法在主机上自动更新ubuntu版本,使用rufus 制作启动盘,重装系统,安装版本为ubuntu-20.04.6-live-server-amd64。 2.重装系统后根据: https://github.com/luoyuqi-lab/shine-cluster/wiki/%E6%9C%8D%E5%8A%A1%E5%99%A8%E5%AE%89%E8%A3%85#%E6%A3%80%E6%9F%A5%E7%A1%AC%E7%9B%98%E5%88%86%E5%8C%BA https://github.com/luoyuqi-lab/shine-cluster/issues/122 进行逐步安装, 3.安装munge参考连接 https://blog.csdn.net/qq_41867980/article/details/113928942 需要注意的是: (1)munge的 uid和gid必须与aha保持一致,如果不一致:

# usermod -u 2005 username
# groupmod -g 3000 username

参考连接:https://blog.csdn.net/train006l/article/details/79007483 (2)如果成功安装munge,测试aha与新节点能够相互联通,如果不能,检查slurm.conf中的IP是否正确,如果能相互联通但是AHA上仍然未成功添加新节点,在sudo scontrol show partition 查看新加节点是否在其中,新节点需要在slurm.conf中将节点名加入partition中。

luoyuqi-lab commented 4 months ago

1.说明我们需要修改哪些配置,slurm.conf, hosts, 或者更多,这些文件路径在哪里?

  1. 如何安装slurm, 命令是什么呢? 安装好了如何启动\重启各种服务。
  2. munge所谓的uid gid一致应该是指运行用户的id,用什么命令检查munge自身是否启动,什么命令检查是否能与别的服务器通信。 我觉得我们需要的是更多代码,下次安装新的节点就可以直接一步一步复制,然后成功。 可以参考安装hoth时候写的,https://github.com/luoyuqi-lab/shine-cluster/issues/122#issuecomment-1231416362 ,尽管有一些疏漏或者错误,我觉得在这个基础上添加昨天成功的经验就很好。
lyulyul commented 4 months ago

naboo有两个ubuntu,一个是20.04,一个是18.04,尝试安装osuninstall对18.04进行卸载, 1.install OS-Uninstaller in Ubuntu 在这一步失败,原因似乎是权限不够 1709284530904 2.get a disk including OS-Uninstaller 显示没有网络连接失败

目前将ubuntu 20.04的优先级最高,开机自动选择20.04版本,不影响日常使用,但未删除18.04。

gqqnbig commented 4 months ago

vmware_2024-03-01_22-35-10

  1. According to 硬盘分区及挂载 , compute node is expected to have 200GB in root, and 512GB in /home. Naboo doesn't have the documented partition size.
  2. I have colors on aha (the ${\color{aqua}aqua}$ path and the ${\color{green}green}$ arrow), but they are gone in naboo. I can't live without them.

Can you fix?

gqqnbig commented 2 months ago

I firmly believe naboo is not production-ready. I execute the same steps, but the output is different between naboo and dagobah.

firefox_8Wgqzpfhul

gqqnbig commented 2 months ago

I can't create environments on naboo either.

$ conda create --name naboo

NotWritableError: The current user does not have write permissions to a required path.
  path: /home/qiqig/shared/.conda/envs/.conda_envs_dir_test
  uid: 1002
  gid: 1002

If you feel that permissions on this path are set incorrectly, you can manually
change them by executing

  $ sudo chown 1002:1002 /home/qiqig/shared/.conda/envs/.conda_envs_dir_test

In general, it's not advisable to use 'sudo conda'.