首先感谢作者做出如此精彩的工作并分享出来，下面是我在代码实现过程中遇到的问题。
   我想要实现代码中的semantic_segmentation部分，根据相关的readme文件下载了数据集并对slurm_train.sh进行的修改，在看到这篇代码之前并不理解slurm是什么，查资料后发现您的实现环境跟我并不相同，我只是在实验室的单台服务器上运行代码，我也看了其他问题下有和我类似的情况，所以我将slurm_train.sh改为下面这种情况

!/usr/bin/env bash

NOW=$(date '+%m-%d-%H:%M:%S') OUTPUT_DIR=../outputs/seg

CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k

CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base

CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}

python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} \ --launcher="pytorch" \ --work-dir=${WORK_DIR} \ --options model.pretrained=${CKPT} \

接着在终端运行bash slurm_train.sh 遇到了下面的问题

(biformer) katie@a-ubuntu-16-04-lts:~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 12772) of binary: /home/katie/anaconda3/envs/biformer/bin/python Traceback (most recent call last): File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 1 (local_rank: 1) exitcode : 2 (pid: 12773) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 0 (local_rank: 0) exitcode : 2 (pid: 12772) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

除去一些较长的warning 主要的error我认为可能在train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py，但检查了很多遍都觉得这个路径没有问题，想问一下您是否有更好的解决方法，或者发现我的操作哪里存在问题，非常非常期待您的回复

你可以比较一下open-mmlab的两个训练脚本来对应更改

https://github.com/open-mmlab/mmsegmentation/blob/main/tools/dist_train.sh （用于multi-gpu本地机器）
https://github.com/open-mmlab/mmsegmentation/blob/main/tools/slurm_train.sh (用于slurm管理的集群)

谢谢您的回复，我会尽力试一试

作者您好，在解决了配置文件读不到的问题后，在构建模型时又出现了KeyError: "EncoderDecoder: 'MaxViTSTL_mm is not in the models registry'"的问题，查了之后发现应该是自定的model没有加进mmcv包中的registry.py，所以读不到MaxViTSTL_mm模块，网上的解决方法都是在mmsegmentation框架下运行setup.py更新包，但我们整个project似乎没有setup.py这样的文件，您有什么办法解决这个问题吗，非常感谢，期待您的回复。

我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的：

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/models_mm/__init__.py#L2

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/train.py#L11

要在非slurm环境中运行不需要动其他任何东西，只要修改launch的脚本slurm_train.sh，你似乎把问题复杂化了

   首先感谢作者做出如此精彩的工作并分享出来，下面是我在代码实现过程中遇到的问题。
   我想要实现代码中的semantic_segmentation部分，根据相关的readme文件下载了数据集并对slurm_train.sh进行的修改，在看到这篇代码之前并不理解slurm是什么，查资料后发现您的实现环境跟我并不相同，我只是在实验室的单台服务器上运行代码，我也看了其他问题下有和我类似的情况，所以我将slurm_train.sh改为下面这种情况
！/usr/bin/env bash

NOW=$（date '+%m-%d-%H：%M：%S'） OUTPUT_DIR=../输出/赛格

CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k

CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base

CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}

python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} --launcher=“pytorch” --work-dir=${WORK_DIR} - -options model.pretrained=${CKPT} \

接着在终端运行bash slurm_train.sh 遇到了下面的问题

（biformer） katie@a-ubuntu-16-04-lts：~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py：178： FutureWarning：模块torch.distributed.launch已被弃用，将来会被删除。使用火炬。请注意，--use_env 是默认在火炬运行中设置的。如果您的脚本需要设置参数，请将其更改为从中读取。有关进一步说明，请参阅 https://pytorch.org/docs/stable/distributed.html#launch-utility`--local_rank``os.environ['LOCAL_RANK']`

warnings.warn（ WARNING：Torch.distributed.run：

将每个进程的环境变量设置为默认值OMP_NUM_THREADS 1，以避免系统过载，请根据需要进一步调整变量以在应用程序中获得最佳性能。

/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py：20：用户警告：1 年 2023 月 2 日，MMCV 将发布 v0.0.3，其中将删除与训练过程相关的组件并添加数据转换模块。此外，它将软件包名称 mmcv 重命名为 mmcv-lite，将 mmcv-full 重命名为 mmcv。有关更多详细信息，请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md。

warnings.warn（ /home/katie/anaconda3/envs/biformer/lib/python8.20/site-packages/mmcv /init.py：1：用户警告：2023 年 2 月 0 日，MMCV 将发布 v0.20.20，其中将删除与训练过程相关的组件并添加数据转换模块。此外，它将软件包名称 mmcv 重命名为 mmcv-lite，将 mmcv-full 重命名为 mmcv。有关更多详细信息，请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md。 warnings.warn（用法： train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--种子][--确定性][--选项选项 [选项 ...]] [--launcher {none，pytorch，slurm，mpi}] [--local_rank LOCAL_RANK] train.py：错误：无法识别的参数： /home/katie/code/semantic_segmentation/configs/ade2k/upernet.biformer_base.py 用法： train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPU | --gpu-ids GPU_IDS [GPU_IDS ...]]

[--种子][--确定性][--选项选项 [选项 ...]] [--launcher {none，pytorch，slurm，mpi}] [--local_rank LOCAL_RANK] train.py：错误：无法识别的参数： /home/katie/code/semantic_segmentation/configs/ade0k/upernet.biformer_base.py 错误：torch.distributed.elastic.multiprocessing.api：failed（退出代码：12772） local_rank： 3 （pid： 3）二进制： /home/katie/anaconda3/envs/biformer/bin/python 回溯（最近一次调用）：文件 “/home/katie/anaconda8/envs/biformer/lib/python194.3/runpy.py”，第 3 行，_run_module_as_main 返回 _run_code（code， main_globals， None， File “/home/katie/anaconda8/envs/biformer/lib/python87.3/runpy.py”，第 3 行，在 _run_code exec（code， run_globals）文件中 “/home/katie/anaconda8/envs/biformer/lib/python193.3/site-packages/torch/distributed/launch.py”，第 3 行，在 main（）中文件 “/home/katie/anaconda8/envs/biformer/lib/python189.3/site-packages/torch/distributed/launch.py”，第 3 行，在 main launch（args）文件中 “/home/katie/anaconda8/envs/biformer/lib/python174.3/site-packages/torch/distributed/launch.py”，第 3 行，在 launch run（args）

文件中 “/home/katie/anaconda8/envs/biformer/lib/python752.3/site-packages/torch/distributed/run.py”，第 3 行，在 run elastic_launch（文件 “/home/katie/anaconda8/envs/biformer/lib/python131.3/site-packages/torch/distributed/launcher/api.py”，第 3 行，在调用 返回中 launch_agent（self._config， self._entrypoint， list（args））文件 “/home/katie/anaconda8/envs/biformer/lib/python245.<>/site-packages/torch/distributed/launcher/api.py”，第 <> 行，launch_agent 引发 ChildFailedError（ torch.distributed.elastic.multiprocessing.errors.ChildFailedError：

train.py 失败

失败：

[1]：时间： 2023-08-15_15：54：52 主机： A-Ubuntu-16-04-LTS 等级： 1 （local_rank： 1）退出代码： 2 （PID： 12773）

error_file： <不适用> 回溯：要启用回溯，请参阅： https://pytorch.org/docs/stable/elastic/errors.html

根本原因（首次观察到的故障）：

[0]：时间： 2023-08-15_15：54：52 主机： A-Ubuntu-16-04-LTS 等级： 0 （local_rank： 0）退出代码： 2 （PID： 12772）

error_file： <N/A> 回溯：要启用回溯，请参阅： https://pytorch.org/docs/stable/elastic/errors.html 除去一些较长的警告主要的error我认为可能在 train.py： error： unrecognized arguments： /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py，但检查了很多遍都觉得这个路径没有问题，想问一下您是否有更好的解决方法，或者发现我的操作哪里存在问题，非常非常期待您的回复

您好请问您现在是否已经实现了单台服务器的训练？对于slurm_train.sh的修改成功了吗

我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的：

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/models_mm/__init__.py#L2

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/train.py#L11

要在非slurm环境中运行不需要动其他任何东西，只要修改launch的脚本slurm_train.sh，你似乎把问题复杂化了

您好，请问您这边有非slurm环境下，对launch脚本slurm_train.sh修改的文件吗？如果有的话，您可以分享一下吗？再次感谢您所做的工作。

我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的：

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/models_mm/__init__.py#L2

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/train.py#L11

要在非slurm环境中运行不需要动其他任何东西，只要修改launch的脚本slurm_train.sh，你似乎把问题复杂化了

您好，请问您这边有非slurm环境下，对launch脚本slurm_train.sh修改的文件吗？如果有的话，您可以分享一下吗？再次感谢您所做的工作。

您得到代码了吗，可以分享下吗

rayleizhu / BiFormer

分割模型实现问题 #27

!/usr/bin/env bash

train.py FAILED

Failures: [1]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 1 (local_rank: 1) exitcode : 2 (pid: 12773) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 0 (local_rank: 0) exitcode : 2 (pid: 12772) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

！/usr/bin/env bash

train.py 失败

失败：

根本原因（首次观察到的故障）：