pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
82.53k stars 22.21k forks source link

[Caffe2] Operators of Detectron module not registered/compiled when built on windows #7912

Open II-Matto opened 6 years ago

II-Matto commented 6 years ago

Issue description

I am using Caffe2+Detectron in Windows. After successfully building Caffe2 (with CUDA, cuDNN, OpenCV), COCOAPI and Detectron modules, I ran the tools/train_net.py script in Detectron, trying to train Faster R-CNN on Pascal VOC. But the following errors appeared, reporting a Detectron operator AffineChannel not registered. With different configurations, similar errors for other Detectron operators happen.

...
  File "D:/repo/github/Detectron_facebookresearch\detectron\utils\train.py", line 53, in train_model
    model, weights_file, start_iter, checkpoints, output_dir = create_model()
  File "D:/repo/github/Detectron_facebookresearch\detectron\utils\train.py", line 132, in create_model
    model = model_builder.create(cfg.MODEL.TYPE, train=True)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\model_builder.py", line 124, in create
    return get_func(model_type_func)(model)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\model_builder.py", line 89, in generalized_rcnn
    freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\model_builder.py", line 229, in build_generic_detection_model
    optim.build_data_parallel_model(model, _single_gpu_build_func)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\optimizer.py", line 40, in build_data_parallel_model
    all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\optimizer.py", line 63, in _build_forward_graph
    all_loss_gradients.update(single_gpu_build_func(model))
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\model_builder.py", line 169, in _single_gpu_build_func
    blob_conv, dim_conv, spatial_scale_conv = add_conv_body_func(model)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\ResNet.py", line 36, in add_ResNet50_conv4_body
    return add_ResNet_convX_body(model, (3, 4, 6))
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\ResNet.py", line 98, in add_ResNet_convX_body
    p, dim_in = globals()[cfg.RESNETS.STEM_FUNC](model, 'data')
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\ResNet.py", line 252, in basic_bn_stem
    p = model.AffineChannel(p, 'res_conv1_bn', dim=dim, inplace=True)
  File "D:/repo/github/Detectron_facebookresearch\detectron\modeling\detector.py", line 103, in AffineChannel
    return self.net.AffineChannel([blob_in, scale, bias], blob_in)
  File "D:/repo/github/pytorch/build\caffe2\python\core.py", line 2067, in __getattr__
    ",".join(workspace.C.nearby_opnames(op_type)) + ']'
AttributeError: Method AffineChannel is not a registered operator. Did you mean: []

I have modified import_detectron_ops() in detectron/utils/c2.py to use my caffe2_detectron_ops_gpu.dll path.

I have added the following path with sys.path.insert(0, path) in the training script.

I have added the following path to my PATH variable.

The import commands seem to have all been successful. So I guess the environment setting should be OK.

I used the dumpbin tool to examine my caffe2_detectron_ops_gpu.dll, which only has a size of ~5.5MB.

With the EXPORTS option, the results are as follows:

Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file D:\repo\github\pytorch\build\bin\Release\caffe2_detectron_ops_gpu.dll

File Type: DLL

  Section contains the following exports for caffe2_detectron_ops_gpu.dll

    00000000 characteristics
    5B0CBE13 time date stamp Tue May 29 10:42:27 2018
        0.00 version
           1 ordinal base
           1 number of functions
           1 number of names

    ordinal hint RVA      name

          1    0 003393E8 NvOptimusEnablementCuda

  Summary

       13000 .data
        1000 .gfids
        1000 .nvFatBi
      20C000 .nv_fatb
       23000 .pdata
       E6000 .rdata
        5000 .reloc
        1000 .rsrc
      252000 .text
        1000 .tls

With the SYMBOLS option, the results are as follows:

Microsoft (R) COFF/PE Dumper Version 14.00.24215.1
Copyright (C) Microsoft Corporation.  All rights reserved.

Dump of file D:\repo\github\pytorch\build\bin\Release\caffe2_detectron_ops_gpu.dll

File Type: DLL

  Summary

       13000 .data
        1000 .gfids
        1000 .nvFatBi
      20C000 .nv_fatb
       23000 .pdata
       E6000 .rdata
        5000 .reloc
        1000 .rsrc
      252000 .text
        1000 .tls

Does this mean the Detectron operators are actually not compiled? If so, what could possibly be the reason and how can I make them compile?

Code example

  1. Build Caffe2 with CUDA, cuDNN, OpenCV.
  2. Build COCOAPI modules.
  3. Build Detectron modules.
  4. Add all sorts of paths properly (as described above).
  5. Run tools/train_net.py in Detectron with proper arguments.

System Info

II-Matto commented 6 years ago

FYI, there are some discussions on this problem in issue https://github.com/facebookresearch/Detectron/issues/454.

andreigh commented 6 years ago

II-Matto did you manage to get it running somehow ?

anhnp82 commented 5 years ago

by just moving all the files under modules/detectron into caffe2/operators I can now use detectron's operators within caffe2 in windows