[Bug] fps --task inference

crml233 commented 8 months ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (master) or latest version (1.x).

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux Python: 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA TITAN X (Pascal) CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.24 GCC: gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

GCC 9.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.3.2 (built against CUDA 11.5)
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

TorchVision: 0.13.1 OpenCV: 4.8.0 MMEngine: 0.7.4 MMRotate: 1.0.0rc1+4aae1fc

Reproduces the problem - code sample

When the --task is set as default 'dataloader', the following command works.

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --launcher pytorch

eg: ............. 03/19 10:20:35 - mmengine - INFO - ============== Done ================== 03/19 10:20:35 - mmengine - INFO - Overall fps: 120.2 batch/s, times per batch: 8.3 ms/batch, batch size: 1, num_workers: 2 03/19 10:20:35 - mmengine - INFO - (GB) mem_used: 9.38 | uss: 0.11 | pss: 0.36 | total_proc: 3 .........

But when I change 'dataloader' to 'inference', an error occurs:

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --launcher pytorch

with '--task' set to 'inference' in tools/analysis_tools/benchmark.py

or

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --task inference --launcher pytorch

Reproduces the problem - error message

     FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torchrun.
    Note that --use_env is set by default in torchrun.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ['LOCAL_RANK']` instead. See 
    https://pytorch.org/docs/stable/distributed.html#launch-utility for 
    further instructions

      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:26: UserWarning: Multi-processing start method `fork` is different from the previous setting `spawn`.It will be force set to `fork`. You can change this behavior by changing `mp_start_method` in your config.
      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:46: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
      warnings.warn(
    03/19 10:27:59 - mmengine - INFO - before build: 
    03/19 10:27:59 - mmengine - INFO - (GB) mem_used: 9.34 | uss: 0.25 | pss: 0.30 | total_proc: 1
    Traceback (most recent call last):
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 122, in build_from_cfg
        obj = obj_cls(**args)  # type: ignore
    TypeError: __init__() got an unexpected keyword argument 'pretrained'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "tools/analysis_tools/benchmark.py", line 134, in <module>
        main()
      File "tools/analysis_tools/benchmark.py", line 129, in main
        benchmark = eval(f'{args.task}_benchmark')(args, cfg, distributed, logger)
      File "tools/analysis_tools/benchmark.py", line 72, in inference_benchmark
        benchmark = InferenceBenchmark(
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmdet/utils/benchmark.py", line 164, in __init__
        self.model = self._init_model(checkpoint, is_fuse_conv_bn)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmdet/utils/benchmark.py", line 180, in _init_model
        model = MODELS.build(self.cfg.model)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/registry.py", line 548, in build
        return self.build_func(cfg, *args, **kwargs, registry=self)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 250, in build_model_from_cfg
        return build_from_cfg(cfg, registry, default_args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 144, in build_from_cfg
        raise type(e)(
    TypeError: class `FCOS` in mmdet/models/detectors/fcos.py: __init__() got an unexpected keyword argument 'pretrained'
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22978) of binary: /home/czj/anaconda3/envs/mmrotate/bin/python
    Traceback (most recent call last):
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
        main()
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
        launch(args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
        run(args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
        elastic_launch(
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    tools/analysis_tools/benchmark.py FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2024-03-19_10:28:02
      host      : ubuntu-Super-Server
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 22978)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================

Additional information

No response

qingyun259 commented 7 months ago

Hi，I just met the same question. I modified these three places. Just run the benchmark.py and it works on windows system now. Hope it can help you.

os.environ['MASTER_ADDR'] = 'localhost'  
os.environ['MASTER_PORT'] = '12345'     
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = 'gloo'

init_dist(args.launcher, 'gloo')

crml233 commented 6 months ago

Hi，I just met the same question. I modified these three places. Just run the benchmark.py and it works on windows system now. Hope it can help you.
os.environ['MASTER_ADDR'] = 'localhost'  
os.environ['MASTER_PORT'] = '12345'     
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = 'gloo'
init_dist(args.launcher, 'gloo')

Thank you very much! It works!!

open-mmlab / mmrotate