open-mmlab / mmrotate

OpenMMLab Rotated Object Detection Toolbox and Benchmark
https://mmrotate.readthedocs.io/en/latest/
Apache License 2.0
1.88k stars 557 forks source link

[Bug] fps --task inference #1011

Open crml233 opened 8 months ago

crml233 commented 8 months ago

Prerequisite

Task

I have modified the scripts/configs, or I'm working on my own tasks/models/datasets.

Branch

master branch https://github.com/open-mmlab/mmrotate

Environment

sys.platform: linux Python: 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0] CUDA available: True numpy_random_seed: 2147483648 GPU 0,1,2,3: NVIDIA TITAN X (Pascal) CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 10.1, V10.1.24 GCC: gcc (Ubuntu 5.5.0-12ubuntu1~16.04) 5.5.0 20171010 PyTorch: 1.12.1 PyTorch compiling details: PyTorch built with:

TorchVision: 0.13.1 OpenCV: 4.8.0 MMEngine: 0.7.4 MMRotate: 1.0.0rc1+4aae1fc

Reproduces the problem - code sample

When the --task is set as default 'dataloader', the following command works.

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --launcher pytorch

eg: ............. 03/19 10:20:35 - mmengine - INFO - ============== Done ================== 03/19 10:20:35 - mmengine - INFO - Overall fps: 120.2 batch/s, times per batch: 8.3 ms/batch, batch size: 1, num_workers: 2 03/19 10:20:35 - mmengine - INFO - (GB) mem_used: 9.38 | uss: 0.11 | pss: 0.36 | total_proc: 3 .........

But when I change 'dataloader' to 'inference', an error occurs:

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --launcher pytorch

with '--task' set to 'inference' in tools/analysis_tools/benchmark.py

or

CUDA_VISIBLE_DEVICES=3 python -m torch.distributed.launch --nproc_per_node=1 --master_port=29500 tools/analysis_tools/benchmark.py /home/czj/mmrotate/cfg_ship/SUBSRS/sub1/fcos_sub1_100e.py --checkpoint /home/czj/mmrotate/work_dirs/fcos_sub1_100e/epoch_100.pth --task inference --launcher pytorch

Reproduces the problem - error message


     FutureWarning: The module torch.distributed.launch is deprecated
    and will be removed in future. Use torchrun.
    Note that --use_env is set by default in torchrun.
    If your script expects `--local_rank` argument to be set, please
    change it to read from `os.environ['LOCAL_RANK']` instead. See 
    https://pytorch.org/docs/stable/distributed.html#launch-utility for 
    further instructions

      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:26: UserWarning: Multi-processing start method `fork` is different from the previous setting `spawn`.It will be force set to `fork`. You can change this behavior by changing `mp_start_method` in your config.
      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:46: UserWarning: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
      warnings.warn(
    /home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/utils/dl_utils/setup_env.py:56: UserWarning: Setting MKL_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
      warnings.warn(
    03/19 10:27:59 - mmengine - INFO - before build: 
    03/19 10:27:59 - mmengine - INFO - (GB) mem_used: 9.34 | uss: 0.25 | pss: 0.30 | total_proc: 1
    Traceback (most recent call last):
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 122, in build_from_cfg
        obj = obj_cls(**args)  # type: ignore
    TypeError: __init__() got an unexpected keyword argument 'pretrained'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "tools/analysis_tools/benchmark.py", line 134, in <module>
        main()
      File "tools/analysis_tools/benchmark.py", line 129, in main
        benchmark = eval(f'{args.task}_benchmark')(args, cfg, distributed, logger)
      File "tools/analysis_tools/benchmark.py", line 72, in inference_benchmark
        benchmark = InferenceBenchmark(
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmdet/utils/benchmark.py", line 164, in __init__
        self.model = self._init_model(checkpoint, is_fuse_conv_bn)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmdet/utils/benchmark.py", line 180, in _init_model
        model = MODELS.build(self.cfg.model)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/registry.py", line 548, in build
        return self.build_func(cfg, *args, **kwargs, registry=self)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 250, in build_model_from_cfg
        return build_from_cfg(cfg, registry, default_args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 144, in build_from_cfg
        raise type(e)(
    TypeError: class `FCOS` in mmdet/models/detectors/fcos.py: __init__() got an unexpected keyword argument 'pretrained'
    ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 22978) of binary: /home/czj/anaconda3/envs/mmrotate/bin/python
    Traceback (most recent call last):
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
        main()
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
        launch(args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
        run(args)
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
        elastic_launch(
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/home/czj/anaconda3/envs/mmrotate/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    tools/analysis_tools/benchmark.py FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2024-03-19_10:28:02
      host      : ubuntu-Super-Server
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 22978)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================

Additional information

No response

qingyun259 commented 7 months ago

Hi,I just met the same question. I modified these three places. Just run the benchmark.py and it works on windows system now. Hope it can help you. 4 1 2 3

os.environ['MASTER_ADDR'] = 'localhost'  
os.environ['MASTER_PORT'] = '12345'     
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = 'gloo'

init_dist(args.launcher, 'gloo')

crml233 commented 6 months ago

Hi,I just met the same question. I modified these three places. Just run the benchmark.py and it works on windows system now. Hope it can help you. 4 1 2 3

os.environ['MASTER_ADDR'] = 'localhost'  
os.environ['MASTER_PORT'] = '12345'     
os.environ['RANK'] = '0'
os.environ['WORLD_SIZE'] = '1'
os.environ["PL_TORCH_DISTRIBUTED_BACKEND"] = 'gloo'

init_dist(args.launcher, 'gloo')

Thank you very much! It works!!