rayleizhu / BiFormer

[CVPR 2023] Official code release of our paper "BiFormer: Vision Transformer with Bi-Level Routing Attention"
https://arxiv.org/abs/2303.08810
MIT License
460 stars 36 forks source link

分割模型实现问题 #27

Closed katie1109sjuwwbx closed 11 months ago

katie1109sjuwwbx commented 11 months ago
   首先感谢作者做出如此精彩的工作并分享出来,下面是我在代码实现过程中遇到的问题。
   我想要实现代码中的semantic_segmentation部分,根据相关的readme文件下载了数据集并对slurm_train.sh进行的修改,在看到这篇代码之前并不理解slurm是什么,查资料后发现您的实现环境跟我并不相同,我只是在实验室的单台服务器上运行代码,我也看了其他问题下有和我类似的情况,所以我将slurm_train.sh改为下面这种情况

!/usr/bin/env bash

NOW=$(date '+%m-%d-%H:%M:%S') OUTPUT_DIR=../outputs/seg

CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k

CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base

CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}

python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} \ --launcher="pytorch" \ --work-dir=${WORK_DIR} \ --options model.pretrained=${CKPT} \

接着在终端运行bash slurm_train.sh 遇到了下面的问题

(biformer) katie@a-ubuntu-16-04-lts:~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects --local_rank argument to be set, please change it to read from os.environ['LOCAL_RANK'] instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions

warnings.warn( WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 12772) of binary: /home/katie/anaconda3/envs/biformer/bin/python Traceback (most recent call last): File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in main() File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures: [1]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 1 (local_rank: 1) exitcode : 2 (pid: 12773) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 0 (local_rank: 0) exitcode : 2 (pid: 12772) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

除去一些较长的warning 主要的error我认为可能在train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py,但检查了很多遍都觉得这个路径没有问题,想问一下您是否有更好的解决方法,或者发现我的操作哪里存在问题,非常非常期待您的回复

rayleizhu commented 11 months ago

你可以比较一下open-mmlab的两个训练脚本来对应更改

katie1109sjuwwbx commented 11 months ago

谢谢您的回复,我会尽力试一试

katie1109sjuwwbx commented 11 months ago

作者您好,在解决了配置文件读不到的问题后,在构建模型时又出现了KeyError: "EncoderDecoder: 'MaxViTSTL_mm is not in the models registry'"的问题,查了之后发现应该是自定的model没有加进mmcv包中的registry.py,所以读不到MaxViTSTL_mm模块,网上的解决方法都是在mmsegmentation框架下运行setup.py更新包,但我们整个project似乎没有setup.py这样的文件,您有什么办法解决这个问题吗,非常感谢,期待您的回复。

rayleizhu commented 11 months ago
  1. 我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的:

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/models_mm/__init__.py#L2

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/train.py#L11

  1. 要在非slurm环境中运行不需要动其他任何东西,只要修改launch的脚本slurm_train.sh,你似乎把问题复杂化了
taoxingwang commented 9 months ago
   首先感谢作者做出如此精彩的工作并分享出来,下面是我在代码实现过程中遇到的问题。
   我想要实现代码中的semantic_segmentation部分,根据相关的readme文件下载了数据集并对slurm_train.sh进行的修改,在看到这篇代码之前并不理解slurm是什么,查资料后发现您的实现环境跟我并不相同,我只是在实验室的单台服务器上运行代码,我也看了其他问题下有和我类似的情况,所以我将slurm_train.sh改为下面这种情况

!/usr/bin/env bash

NOW=$(date '+%m-%d-%H:%M:%S') OUTPUT_DIR=../输出/赛格

CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k

CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base

CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}

python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} --launcher=“pytorch” --work-dir=${WORK_DIR} - -options model.pretrained=${CKPT} \

接着在终端运行bash slurm_train.sh 遇到了下面的问题

(biformer) katie@a-ubuntu-16-04-lts:~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning:模块torch.distributed.launch已被弃用,将来会被删除。使用火炬。请注意,--use_env 是默认在火炬运行中设置的。如果您的脚本需要设置参数,请将其更改为从中读取。有关进一步说明,请参阅 https://pytorch.org/docs/stable/distributed.html#launch-utility`--local_rank``os.environ['LOCAL_RANK']`

warnings.warn( WARNING:Torch.distributed.run:

将每个进程的环境变量设置为默认值OMP_NUM_THREADS 1,以避免系统过载,请根据需要进一步调整变量以在应用程序中获得最佳性能。

/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: 用户警告:1 年 2023 月 2 日,MMCV 将发布 v0.0.3,其中将删除与训练过程相关的组件并添加数据转换模块。此外,它将软件包名称 mmcv 重命名为 mmcv-lite,将 mmcv-full 重命名为 mmcv。有关更多详细信息,请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md

warnings.warn( /home/katie/anaconda3/envs/biformer/lib/python8.20/site-packages/mmcv /init.py:1: 用户警告:2023 年 2 月 0 日,MMCV 将发布 v0.20.20,其中将删除与训练过程相关的组件并添加数据转换模块。此外,它将软件包名称 mmcv 重命名为 mmcv-lite,将 mmcv-full 重命名为 mmcv。有关更多详细信息,请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md。 warnings.warn( 用法: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--种子][--确定性][--选项 选项 [选项 ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: 错误: 无法识别的参数: /home/katie/code/semantic_segmentation/configs/ade2k/upernet.biformer_base.py 用法: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPU | --gpu-ids GPU_IDS [GPU_IDS ...]]

[--种子][--确定性][--选项 选项 [选项 ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: 错误: 无法识别的参数: /home/katie/code/semantic_segmentation/configs/ade0k/upernet.biformer_base.py 错误:torch.distributed.elastic.multiprocessing.api:failed(退出代码:12772) local_rank: 3 (pid: 3) 二进制: /home/katie/anaconda3/envs/biformer/bin/python 回溯(最近一次调用): 文件 “/home/katie/anaconda8/envs/biformer/lib/python194.3/runpy.py”,第 3 行,_run_module_as_main 返回 _run_code(code, main_globals, None, File “/home/katie/anaconda8/envs/biformer/lib/python87.3/runpy.py”,第 3 行,在 _run_code exec(code, run_globals) 文件中 “/home/katie/anaconda8/envs/biformer/lib/python193.3/site-packages/torch/distributed/launch.py”,第 3 行, 在 main() 中 文件 “/home/katie/anaconda8/envs/biformer/lib/python189.3/site-packages/torch/distributed/launch.py”,第 3 行,在 main launch(args) 文件中 “/home/katie/anaconda8/envs/biformer/lib/python174.3/site-packages/torch/distributed/launch.py”,第 3 行,在 launch run(args)

文件中 “/home/katie/anaconda8/envs/biformer/lib/python752.3/site-packages/torch/distributed/run.py”,第 3 行,在 run elastic_launch( 文件 “/home/katie/anaconda8/envs/biformer/lib/python131.3/site-packages/torch/distributed/launcher/api.py”,第 3 行,在调用 返回中 launch_agent(self._config, self._entrypoint, list(args)) 文件 “/home/katie/anaconda8/envs/biformer/lib/python245.<>/site-packages/torch/distributed/launcher/api.py”,第 <> 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py 失败

失败:

[1]: 时间 : 2023-08-15_15:54:52 主机 : A-Ubuntu-16-04-LTS 等级 : 1 (local_rank: 1) 退出代码 : 2 (PID: 12773)

error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html

根本原因(首次观察到的故障):

[0]: 时间 : 2023-08-15_15:54:52 主机 : A-Ubuntu-16-04-LTS 等级 : 0 (local_rank: 0) 退出代码 : 2 (PID: 12772)

error_file: <N/A> 回溯 : 要启用 回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html 除去一些较长的警告主要的error我认为可能在 train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py,但检查了很多遍都觉得这个路径没有问题,想问一下您是否有更好的解决方法,或者发现我的操作哪里存在问题,非常非常期待您的回复

您好请问您现在是否已经实现了单台服务器的训练?对于slurm_train.sh的修改成功了吗

taoxingwang commented 9 months ago
  1. 我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的:

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/models_mm/__init__.py#L2

https://github.com/rayleizhu/BiFormer/blob/1697bbbeafb8680524898f1dcaac10defd0604be/semantic_segmentation/train.py#L11

  1. 要在非slurm环境中运行不需要动其他任何东西,只要修改launch的脚本slurm_train.sh,你似乎把问题复杂化了

您好,请问您这边有非slurm环境下,对launch脚本slurm_train.sh修改的文件吗?如果有的话,您可以分享一下吗?再次感谢您所做的工作。