Visualization of FasterRCNN graph from Tensorboard

AwePhD commented 2 years ago

Hi,

First of all, thanks for dev of mmdetection, this seems to be a great tool to focus on design of advanced architecture for detection task family.

Bug description

I am trying to visualize FASTER RCNN model graph with Tensorboard. I am not able to properly run the line write.add_graph to send the model to tensorboard.

Reproduction

from torch.utils.tensorboard import SummaryWriter
from mmdet.models import build_detector
from mmcv import Config
from tests.test_models.test_forward import _demo_mm_inputs

cfg_model = "configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py"
cfg = Config.fromfile(cfg_model)

model = cfg.model
model['pretrained'] = None

writer = SummaryWriter("tensorboard_runs/faster_rcnn")

detector = build_detector(model)
input_shape = (1, 3, 256, 256)

# Test forward train with a non-empty truth batch
mm_inputs = _demo_mm_inputs(input_shape, num_items=[10])
imgs = mm_inputs.pop('imgs')
img_metas = mm_inputs.pop('img_metas')
gt_bboxes = mm_inputs['gt_bboxes']
gt_labels = mm_inputs['gt_labels']
gt_masks = mm_inputs['gt_masks']
losses = detector.eval().forward(
    imgs,
    img_metas,
    gt_bboxes=gt_bboxes,
    gt_labels=gt_labels,
    gt_masks=gt_masks,
    return_loss=True,
    )

writer.add_graph(detector.eval(),(imgs, img_metas,gt_bboxes,gt_labels,gt_masks))
writer.close()

the code is strongly based on issue #5140.

Environment

sys.platform: linux
Python: 3.8.8 (default, Feb 24 2021, 21:46:12) [GCC 7.3.0]
CUDA available: True
GPU 0: NVIDIA RTX A4000 Laptop GPU
CUDA_HOME: /usr/local/cuda
NVCC: Build cuda_11.1.TC455_06.29190527_0
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
PyTorch: 1.8.1
PyTorch compiling details: PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.0.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

TorchVision: 0.9.1
OpenCV: 4.6.0
MMCV: 1.3.17
MMCV Compiler: GCC 7.3
MMCV CUDA Compiler: 11.1
MMDetection: 2.25.0+56e42e7

Installed from Dockerfile which is strongly based on the repo Dockerfile. I have done minor changes, here the list :

Change PYTORCH/CUDA version in image and mmcv-full install to match a compatible version for my GPU.
Add new public key of nvidia apt-get repo.
Tensorboard installation
Add script.py which is code written above.

ARG PYTORCH="1.8.1"
ARG CUDA="11.1"
ARG CUDNN="8"

FROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"

RUN rm /etc/apt/sources.list.d/cuda.list \
  && rm /etc/apt/sources.list.d/nvidia-ml.list \
  && apt-key del 7fa2af80 \
  && apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/3bf863cc.pub \
  && apt-get update && apt-get install -y ffmpeg libsm6 libxext6 git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

# Install MMCV
RUN pip install --no-cache-dir --upgrade pip wheel setuptools \
  && pip install --no-cache-dir mmcv-full==1.3.17 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html

# Install MMDetection
RUN conda clean --all \
  && git clone https://github.com/open-mmlab/mmdetection.git /mmdetection
WORKDIR /mmdetection
ENV FORCE_CUDA="1"
RUN pip install --no-cache-dir -r requirements/build.txt \
  && pip install --no-cache-dir -e .

# Install tensorboard
RUN pip install --no-cache-dir tensorboard

# Add model
COPY ./script.py .

Error trackback

Basically the same output as this comment.

Error occurs, No graph saved
Traceback (most recent call last):
  File "/mmdetection/script.py", line 33, in <module>
    writer.add_graph(detector.eval(),(imgs, img_metas,gt_bboxes,gt_labels,gt_masks))
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/writer.py", line 723, in add_graph
    self._get_file_writer().add_graph(graph(model, input_to_model, verbose))
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 292, in graph
    raise e
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/tensorboard/_pytorch_graph.py", line 286, in graph
    trace = torch.jit.trace(model, args)
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 733, in trace
    return trace_module(
  File "/opt/conda/lib/python3.8/site-packages/torch/jit/_trace.py", line 934, in trace_module
    module._c._create_method_from_trace(
RuntimeError: Tracer cannot infer type of (tensor([[[[0.5488, 0.7152, 0.6028,  ..., 0.7487, 0.9037, 0.0834],
          [0.5522, 0.5845, 0.9619,  ..., 0.4461, 0.1046, 0.3485],
          [0.7401, 0.6805, 0.6224,  ..., 0.9450, 0.9919, 0.3767],
          ...,
          [0.5347, 0.1355, 0.3433,  ..., 0.9448, 0.5501, 0.4152],
          [0.6550, 0.8695, 0.7427,  ..., 0.7006, 0.1240, 0.1202],
          [0.9689, 0.2844, 0.2874,  ..., 0.6264, 0.0966, 0.8441]],

         [[0.1356, 0.8424, 0.9262,  ..., 0.9498, 0.7877, 0.2874],
          [0.8619, 0.4222, 0.5368,  ..., 0.5023, 0.0962, 0.6300],
          [0.8680, 0.9831, 0.6594,  ..., 0.0459, 0.6224, 0.2136],
          ...,
          [0.4898, 0.7893, 0.4492,  ..., 0.6242, 0.3400, 0.5197],
          [0.0792, 0.1700, 0.2699,  ..., 0.7916, 0.3890, 0.6352],
          [0.6384, 0.0778, 0.4447,  ..., 0.3807, 0.9440, 0.1556]],

         [[0.4450, 0.7703, 0.7811,  ..., 0.3255, 0.2211, 0.4891],
          [0.7809, 0.8071, 0.4325,  ..., 0.4420, 0.6906, 0.9553],
          [0.1343, 0.5162, 0.6405,  ..., 0.7125, 0.6389, 0.4962],
          ...,
          [0.5266, 0.0853, 0.4650,  ..., 0.7655, 0.7202, 0.9694],
          [0.1602, 0.1703, 0.0182,  ..., 0.4301, 0.1518, 0.9542],
          [0.1611, 0.1013, 0.6607,  ..., 0.0081, 0.9399, 0.8611]]]],
       requires_grad=True), [{'img_shape': (256, 256, 3), 'ori_shape': (256, 256, 3), 'pad_shape': (256, 256, 3), 'filename': '<demo>.png', 'scale_factor': array([1.1, 1.2, 1.1, 1.2]), 'flip': False, 'flip_direction': None}], [tensor([[  0.0000,   0.0000,  62.4313, 140.5942],
        [ 91.9011, 133.2158,  98.5872, 170.8792],
        [  0.0000, 161.8527,  92.9992, 207.3777],
        [  0.0000, 129.3509, 131.0759, 211.6468],
        [ 17.0086, 173.0515,  75.2071, 249.1510],
        [  0.0000,   0.0000, 132.8615,  88.4271],
        [  0.0000, 154.4528,  94.0470, 156.2830],
        [  0.0000,   0.0000, 117.2081, 208.8974],
        [  0.0000,  73.4916,  82.0480, 256.0000],
        [219.8390,  73.8144, 256.0000, 127.1684]])], [tensor([1, 7, 2, 5, 5, 1, 6, 4, 9, 4])], [BitmapMasks(num_masks=10, height=256, width=256)])
:Could not infer type of list element: Dictionary inputs to traced functions must have consistent type. Found Tuple[int, int, int] and str

Bug fix

I read the following issue #6722, #5140, #5018, #5076, #1006 but I do not think it fixes the issue. Or I really missed something. These issues are solved with an upgrade of mmcv or torch version. My version of torch is 1.8.1 and my version of mmcv is 1.3.17 which are higher or equal to the updated version in the issues. I got it that for visualizing metrics I need to use hooks for getting scalars from runner (from what I understood). And for graph, the only thing I can see is to export to ONXX from the mmdet doc but in my case it did not work. Plus, I aim to work on more complex model which use a mix of Deformable Attention layer and language model so I am not sure that ONXX would be usable for visualization of such complex layers.

So I am starting to think there is no way to visualize model from MMDET in tensorboard :( I would love to keep using mmdet lib without dropping native torch visualization.

Let me know if you want more details or tell me if I missed something, please.

Regards, Mathias.

Czm369 commented 2 years ago

We should now have updated the tensorboard correlation function, so you can try it again

SmallingCar commented 2 years ago

can't work

AwePhD commented 2 years ago

can't work

What do you mean? Is my method bad? Or is it related to something else? Can you detail more please?

SmallingCar commented 2 years ago

my mean that when i insert this code to my progress,I cannot get the photograph of my Neutral Network.so what should I do next time.Please give me some advise.Thank you sincerely.

在 2022-11-08 00:29:43，"Mathias Réus" @.***> 写道：

can't work

What do you mean? Is my method bad? Or is it related to something else? Can you detail more please?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

open-mmlab / mmdetection