有多机分布式任务的example吗？

PKUFlyingPig commented 2 months ago

nice work！想请问是否有多机的 example 示例或者论文中实验的复现脚本，我看代码似乎是必须用 slurm 起机器是吗？例如我想复现论文中 70B+70B 的 end2end 实验的话能否给出步骤和建议呢，谢谢！

garrett4wade commented 2 months ago

Hi你好，感谢关注我们的工作！

我们昨天刚更新了用ray启动分布式的example，您可以clone main上的代码，然后去看下新的doc。

简单来说，按照ray的教程setup ray cluster之后，examples文件夹下面的script把mode设成ray就可以跑了。

在杀了实验之后，请记得在每个节点上重启下ray，不然ray的进程会stale，再起下一个实验会卡住。

PKUFlyingPig commented 1 month ago

你好，我按照文档里 quickstart 的 PPO example 时执行了如下指令：

python3 -m realhf.apps.quickstart ppo \
    experiment_name=quickstart-ppo \
    trial_name=release \
    exp_ctrl.total_train_epochs=1 \
    exp_ctrl.save_freq_steps=null \
    allocation_mode=heuristic \
    actor.type._class=llama \
    actor.path=meta-llama/Llama-2-7b \
    critic.type._class=llama \
    critic.type.is_critic=True \
    critic.path=meta-llama/Llama-2-7b \
    critic.gradient_checkpointing=True \
    ref.type._class=llama \
    ref.path=meta-llama/Llama-2-7b \
    rew.type._class=llama \
    rew.type.is_critic=True \
    rew.path=meta-llama/Llama-2-7b \
    dataset.path=ppo_dataset.jsonl \
    dataset.max_prompt_len=256 \
    dataset.train_bs_n_seqs=128 \
    ppo.gen.max_new_tokens=256 \
    ppo.gen.min_new_tokens=256 \
    ppo.ppo_n_minibatches=4 \
    ppo.kl_ctl=0.1 \
    ppo.value_eps_clip=0.2 \
    ppo.reward_output_scaling=10.0 \
    ppo.adv_norm=True ppo.value_norm=True

但是遇到了如下报错：

20240717-14:38:10.649 quickstart INFO: Running ppo experiment.
20240717-14:38:10.650 quickstart INFO: Logs will be dumped to /home/zhongyinmin/.cache/realhf/logs/zhongyinmin/quickstart-ppo/release
20240717-14:38:10.650 quickstart INFO: Model checkpoints will be saved to /home/zhongyinmin/.cache/realhf/checkpoints/zhongyinmin/quickstart-ppo/release
20240717-14:38:10.651 quickstart WARNING: Slurm is not available. Using local mode.
20240717-14:38:10.654 main WARNING: Environment variable CLUSTER_SPEC_PATH is not set. Files of the experiment (logs, checkpoints, cache ...) will be saved to temporary directory of the system. To change the fileroot, set the fileroot option of your choice in your CLUSTER_SPEC_PATH.
20240717-14:38:10.670 main INFO: Resetting name resolving repo...
20240717-14:38:10.670 name-resolve INFO: No such name resolve path: /home/zhongyinmin/.cache/realhf/name_resolve/zhongyinmin/quickstart-ppo/release
20240717-14:38:10.670 main INFO: Resetting name resolving repo... Done.
20240717-14:38:10.670 main INFO: Running configuration: PPOConfig
20240717-14:38:10.673 Local Scheduler INFO: Waiting for 10 local running processes, pids: 1576 1577 1578 1579 1581 1583 1586 1587 1591 1594
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
/usr/bin/python3: Error while finding module specification for 'realhf.apps.remote' (ModuleNotFoundError: No module named 'realhf')
20240717-14:38:12.678 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [1576]
20240717-14:38:12.679 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [1577, 1578, 1579, 1581, 1583, 1586, 1587, 1591]
20240717-14:38:12.682 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [1594]
20240717-14:38:12.683 Local Scheduler INFO: Waiting for 0 local running processes, pids: 
20240717-14:38:12.683 quickstart WARNING: Exception occurred. Stopping all workers.

我是在单机8卡H100的环境里，并且也安装了 realhf，可以在 python 命令行中成功 import realhf，请问有可能是什么原因导致的呢？

PKUFlyingPig commented 1 month ago

另外，很想更细致地学习你们代码的架构，目前 codebase 比较庞杂而且似乎backend既有megatron-lm 也有 deepspeed。分布式任务虽然能用ray启动但似乎代码里没有看到 ray actor 相关的 code，具体是怎么通过 ray 来分配的资源的呢？请问能否提供一个简洁明了并且 high-level 的 code architecture 的描述从而可以更好地学习你们的工作。我看到文档是有相关的章节的但是似乎还未施工。另外再次感谢开源这么好的工作。

garrett4wade commented 1 month ago

我是在单机8卡H100的环境里，并且也安装了 realhf，可以在 python 命令行中成功 import realhf，请问有可能是什么原因导致的呢？

我觉得可能的原因是装了realhf的python executable并不是/usr/bin/python3。可以检查下：

which python
which python3
python3 -c "import realhf"

具体是怎么通过 ray 来分配的资源的呢

ReaL假设跑在Ray cluster里面，会根据worker scheduling config去给每个worker分配资源。ReaL会申请#GPU个model worker和一个master worker，每个worker是个Ray Actor，资源申请给Ray之后由Ray分配资源并启动worker进程.

能否提供一个简洁明了并且 high-level 的 code architecture 的描述从而可以更好地学习你们的工作

首先感谢你对我们工作的青睐！我们在v0.3.0版本会补上code architecture这一部分的内容。

现在这个库的代码量有40k行，我觉得很难把每个细节都搞懂。理解这坨代码的关键是搞清一些名词和概念。

系统层面，model指的是一个llm（的shard，如果有tensor/pipe parallel的话），model interface指的是某种model function call的实现方式，model backend是跑gradient allreduce和ZeRO optimizer所需要的外部实现（megatron或者deepspeed）。每个model worker跑在1个GPU上，它会有多个model，每个model都有对应的interface和backend；在master worker发request过来之后，model worker会找对应的model，跑request对应的interface实现，跑完之后返回结果，这就是大概的runtime工作方式。

算法层面，模型都是decoder-only LLM没什么可说的，backend我觉得细节也不重要，因为都是外部代码包了一层，他们主要提供ZeRO optimizer的实现。算法中主要是如何组织interface把整个算法的dataflow graph串联起来。你可以去看下我们新搞的两个算法example，把它们和built-in的interface/experiment实现对比一下。实质上都是通过MFCDef这个类和input/output key定义图，然后把每个节点都挂上一种具体interface的实现方式。

接下来就是这份代码如何通过commandline跑起来的。commandline的configuration会先被转换成model worker和master worker的configuration，这个东西交给realhf.apps.main，main_start函数会调用scheduler启动controller和所有的worker（controller是用来做monitor的进程，和RLHF没关系）。

我还做了个临时的图，可以参考一下

code_arch

garrett4wade commented 1 month ago

还有个事情是用meta-llama/Llama-2-7b这个identifier很可能不work，因为ReaL需要根据pytorch_model.bin.index.json来找每个LLM shard到底需要load哪些文件。请先把模型下载下来然后把路径传进去。

openpsi-project / ReaLHF

有多机分布式任务的example吗？ #24