Closed katie1109sjuwwbx closed 1 year ago
你可以比较一下open-mmlab的两个训练脚本来对应更改
谢谢您的回复,我会尽力试一试
作者您好,在解决了配置文件读不到的问题后,在构建模型时又出现了KeyError: "EncoderDecoder: 'MaxViTSTL_mm is not in the models registry'"的问题,查了之后发现应该是自定的model没有加进mmcv包中的registry.py,所以读不到MaxViTSTL_mm模块,网上的解决方法都是在mmsegmentation框架下运行setup.py更新包,但我们整个project似乎没有setup.py这样的文件,您有什么办法解决这个问题吗,非常感谢,期待您的回复。
首先感谢作者做出如此精彩的工作并分享出来,下面是我在代码实现过程中遇到的问题。 我想要实现代码中的semantic_segmentation部分,根据相关的readme文件下载了数据集并对slurm_train.sh进行的修改,在看到这篇代码之前并不理解slurm是什么,查资料后发现您的实现环境跟我并不相同,我只是在实验室的单台服务器上运行代码,我也看了其他问题下有和我类似的情况,所以我将slurm_train.sh改为下面这种情况
!/usr/bin/env bash
NOW=$(date '+%m-%d-%H:%M:%S') OUTPUT_DIR=../输出/赛格
CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k
CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base
CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}
python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} --launcher=“pytorch” --work-dir=${WORK_DIR} - -options model.pretrained=${CKPT} \
接着在终端运行bash slurm_train.sh 遇到了下面的问题
(biformer) katie@a-ubuntu-16-04-lts:~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning:模块torch.distributed.launch已被弃用,将来会被删除。使用火炬。请注意,--use_env 是默认在火炬运行中设置的。如果您的脚本需要设置参数,请将其更改为从中读取。有关进一步说明,请参阅 https://pytorch.org/docs/stable/distributed.html#launch-utility`--local_rank``os.environ['LOCAL_RANK']`
warnings.warn( WARNING:Torch.distributed.run:
将每个进程的环境变量设置为默认值OMP_NUM_THREADS 1,以避免系统过载,请根据需要进一步调整变量以在应用程序中获得最佳性能。
/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: 用户警告:1 年 2023 月 2 日,MMCV 将发布 v0.0.3,其中将删除与训练过程相关的组件并添加数据转换模块。此外,它将软件包名称 mmcv 重命名为 mmcv-lite,将 mmcv-full 重命名为 mmcv。有关更多详细信息,请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md。
warnings.warn( /home/katie/anaconda3/envs/biformer/lib/python8.20/site-packages/mmcv /init.py:1: 用户警告:2023 年 2 月 0 日,MMCV 将发布 v0.20.20,其中将删除与训练过程相关的组件并添加数据转换模块。此外,它将软件包名称 mmcv 重命名为 mmcv-lite,将 mmcv-full 重命名为 mmcv。有关更多详细信息,请参阅 https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md。 warnings.warn( 用法: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--种子][--确定性][--选项 选项 [选项 ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: 错误: 无法识别的参数: /home/katie/code/semantic_segmentation/configs/ade2k/upernet.biformer_base.py 用法: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPU | --gpu-ids GPU_IDS [GPU_IDS ...]]
[--种子][--确定性][--选项 选项 [选项 ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: 错误: 无法识别的参数: /home/katie/code/semantic_segmentation/configs/ade0k/upernet.biformer_base.py 错误:torch.distributed.elastic.multiprocessing.api:failed(退出代码:12772) local_rank: 3 (pid: 3) 二进制: /home/katie/anaconda3/envs/biformer/bin/python 回溯(最近一次调用): 文件 “/home/katie/anaconda8/envs/biformer/lib/python194.3/runpy.py”,第 3 行,_run_module_as_main 返回 _run_code(code, main_globals, None, File “/home/katie/anaconda8/envs/biformer/lib/python87.3/runpy.py”,第 3 行,在 _run_code exec(code, run_globals) 文件中 “/home/katie/anaconda8/envs/biformer/lib/python193.3/site-packages/torch/distributed/launch.py”,第 3 行, 在 main() 中 文件 “/home/katie/anaconda8/envs/biformer/lib/python189.3/site-packages/torch/distributed/launch.py”,第 3 行,在 main launch(args) 文件中 “/home/katie/anaconda8/envs/biformer/lib/python174.3/site-packages/torch/distributed/launch.py”,第 3 行,在 launch run(args)
文件中 “/home/katie/anaconda8/envs/biformer/lib/python752.3/site-packages/torch/distributed/run.py”,第 3 行,在 run elastic_launch( 文件 “/home/katie/anaconda8/envs/biformer/lib/python131.3/site-packages/torch/distributed/launcher/api.py”,第 3 行,在调用 返回中 launch_agent(self._config, self._entrypoint, list(args)) 文件 “/home/katie/anaconda8/envs/biformer/lib/python245.<>/site-packages/torch/distributed/launcher/api.py”,第 <> 行,launch_agent 引发 ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py 失败
失败:
[1]: 时间 : 2023-08-15_15:54:52 主机 : A-Ubuntu-16-04-LTS 等级 : 1 (local_rank: 1) 退出代码 : 2 (PID: 12773)
error_file: <不适用> 回溯 : 要启用回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html
根本原因(首次观察到的故障):
[0]: 时间 : 2023-08-15_15:54:52 主机 : A-Ubuntu-16-04-LTS 等级 : 0 (local_rank: 0) 退出代码 : 2 (PID: 12772)
error_file: <N/A> 回溯 : 要启用 回溯,请参阅: https://pytorch.org/docs/stable/elastic/errors.html 除去一些较长的警告主要的error我认为可能在 train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py,但检查了很多遍都觉得这个路径没有问题,想问一下您是否有更好的解决方法,或者发现我的操作哪里存在问题,非常非常期待您的回复
您好请问您现在是否已经实现了单台服务器的训练?对于slurm_train.sh的修改成功了吗
- 我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的:
- 要在非slurm环境中运行不需要动其他任何东西,只要修改launch的脚本slurm_train.sh,你似乎把问题复杂化了
您好,请问您这边有非slurm环境下,对launch脚本slurm_train.sh修改的文件吗?如果有的话,您可以分享一下吗?再次感谢您所做的工作。
- 我无从得知你做了怎样的修改导致这个问题。我只能告诉你我的代码中模型是这样被注册的:
- 要在非slurm环境中运行不需要动其他任何东西,只要修改launch的脚本slurm_train.sh,你似乎把问题复杂化了
您好,请问您这边有非slurm环境下,对launch脚本slurm_train.sh修改的文件吗?如果有的话,您可以分享一下吗?再次感谢您所做的工作。
您得到代码了吗,可以分享下吗
!/usr/bin/env bash
NOW=$(date '+%m-%d-%H:%M:%S') OUTPUT_DIR=../outputs/seg
CONFIG_DIR=/home/katie/code/semantic_segmentation/configs/ade20k
CKPT=/home/katie/code/semantic_segmentation/biformer_base_best.pth MODEL=upernet.biformer_base
CONFIG=${CONFIG_DIR}/${MODEL}.py WORK_DIR=${OUTPUT_DIR}/${MODEL}/${NOW} mkdir -p ${WORK_DIR}
python -m torch.distributed.launch --nproc_per_node=2 --master_port=25643 train.py ${CONFIG} \ --launcher="pytorch" \ --work-dir=${WORK_DIR} \ --options model.pretrained=${CKPT} \
接着在终端运行bash slurm_train.sh 遇到了下面的问题
(biformer) katie@a-ubuntu-16-04-lts:~/code/semantic_segmentation$ bash slurm_train.sh /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torchrun. Note that --use_env is set by default in torchrun. If your script expects
--local_rank
argument to be set, please change it to read fromos.environ['LOCAL_RANK']
instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( /home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/mmcv/init.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details. warnings.warn( usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py usage: train.py [-h] [--config CONFIG] [--work-dir WORK_DIR] [--load-from LOAD_FROM] [--resume-from RESUME_FROM] [--no-validate] [--gpus GPUS | --gpu-ids GPU_IDS [GPU_IDS ...]] [--seed SEED] [--deterministic] [--options OPTIONS [OPTIONS ...]] [--launcher {none,pytorch,slurm,mpi}] [--local_rank LOCAL_RANK] train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 12772) of binary: /home/katie/anaconda3/envs/biformer/bin/python Traceback (most recent call last): File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/katie/anaconda3/envs/biformer/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
elastic_launch(
File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/katie/anaconda3/envs/biformer/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: [1]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 1 (local_rank: 1) exitcode : 2 (pid: 12773) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-08-15_15:54:52 host : a-ubuntu-16-04-lts rank : 0 (local_rank: 0) exitcode : 2 (pid: 12772) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
除去一些较长的warning 主要的error我认为可能在train.py: error: unrecognized arguments: /home/katie/code/semantic_segmentation/configs/ade20k/upernet.biformer_base.py,但检查了很多遍都觉得这个路径没有问题,想问一下您是否有更好的解决方法,或者发现我的操作哪里存在问题,非常非常期待您的回复