microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.97k stars 1.81k forks source link

Comparison exception: The values for attribute 'shape' do not match: torch.Size([]) != torch.Size([1, 1, 40, 40, 2]). #5679

Open aidevmin opened 11 months ago

aidevmin commented 11 months ago

Describe the issue: I prune yolov7 model with L1Pruner. I followed this guide https://github.com/microsoft/nni/blob/master/examples/compression/pruning/norm_pruning.py . I added this code after this line https://github.com/WongKinYiu/yolov7/blob/84932d70fb9e2932d0a70e4a1f02a1d6dd1dd6ca/train.py#L100

    from nni.compression.pruning import L1NormPruner, L2NormPruner, FPGMPruner
    from nni.compression.speedup import ModelSpeedup
    from nni.compression.utils import auto_set_denpendency_group_ids

    config_list = [{
        # 'total_sparsity': 0.1,
        'sparse_ratio': 0.5,
        'op_types': ['Conv2d'],
    }]

    dummy_input = torch.rand([1, 3, 640, 640]).to(device)
    config_list = auto_set_denpendency_group_ids(model, config_list, dummy_input)

    pruner = L1NormPruner(model, config_list)
    _, masks = pruner.compress()
    pruner.unwrap_model()

    model = ModelSpeedup(model, dummy_input, masks).speedup_model()
    torch.save(model, "pruning_nni_yolov7.pt")
    exit()

But I got this error

        First diverging operator:
        Node diff:
                - %model : __torch__.torch.nn.modules.container.___torch_mangle_398.Sequential = prim::GetAttr[name="model"](%self.1)
                ?                                                               --
                + %model : __torch__.torch.nn.modules.container.___torch_mangle_812.Sequential = prim::GetAttr[name="model"](%self.1)
                ?                                                                ++
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
        Node:
                %2445 : Tensor = prim::Constant[value={2}](), scope: __module.model.105 # /opt/nvidia/deepstream/deepstream-6.2/sources/yolo_deepstream/FINAL_INVESTIGATION/yolov7_removestem/models/yolo.py:135:0
        Source Location:
                /opt/nvidia/deepstream/deepstream-6.2/sources/yolo_deepstream/FINAL_INVESTIGATION/yolov7_removestem/models/yolo.py(135): forward
                /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1488): _slow_forward
                /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1501): _call_impl
                /opt/nvidia/deepstream/deepstream-6.2/sources/yolo_deepstream/FINAL_INVESTIGATION/yolov7_removestem/models/yolo.py(625): forward_once
                /opt/nvidia/deepstream/deepstream-6.2/sources/yolo_deepstream/FINAL_INVESTIGATION/yolov7_removestem/models/yolo.py(599): forward
                /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1488): _slow_forward
                /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py(1501): _call_impl
                /usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(1056): trace_module
                /usr/local/lib/python3.8/dist-packages/torch/jit/_trace.py(794): trace
                /usr/local/lib/python3.8/dist-packages/nni/common/graph_utils.py(91): _trace
                /usr/local/lib/python3.8/dist-packages/nni/common/graph_utils.py(67): __init__
                /usr/local/lib/python3.8/dist-packages/nni/common/graph_utils.py(265): __init__
                /usr/local/lib/python3.8/dist-packages/nni/compression/utils/shape_dependency.py(58): __init__
                /usr/local/lib/python3.8/dist-packages/nni/compression/utils/shape_dependency.py(135): __init__
                /usr/local/lib/python3.8/dist-packages/nni/compression/utils/dependency.py(34): auto_set_denpendency_group_ids
                train_pruning_yolov7.py(112): train
                train_pruning_yolov7.py(639): <module>
        Comparison exception:   The values for attribute 'shape' do not match: torch.Size([]) != torch.Size([1, 1, 40, 40, 2]).

This error encounter also with L2NormPruner and FPGMPruner . I attached log file. pruning_l1norm.log. @J-shang @ultmaster please help me.

Environment:

Configuration:

Log message:

How to reproduce it?:

MarkusDrange commented 11 months ago

I had a similar issue and solved it by putting the model in eval() and passing dummy input through it once, before doing pruning.

aidevmin commented 11 months ago

@MarkusDrange Thank you so much for suggestion. I am going to try it later. Instead of using nni, I used torch-pruning. Could you share experience with pruning yolov7 by using NNI? Is the tradeoff between mAP and speed good?

aidevmin commented 11 months ago

@MarkusDrange Can you save pruned yolov7 model by using torch.save(model, <path-to-save>)?

MarkusDrange commented 11 months ago

Sorry, I am not really working on an identical case, I am working on tracing a yolov8 model and just mentioned the solution as the error message I got was very similar to yours.

A possible fix there could be that due to the fact that the yolov7 model possibly also has a hierarchy of classes (as my yolov8 has), model.model is the actual model that you want to save.

aidevmin commented 11 months ago

@MarkusDrange Thanks. I have one more question. Could you finetuning with multiple GPUs after pruning?

Gooddz1 commented 7 months ago

I had a similar issue and solved it by putting the model in eval() and passing dummy input through it once, before doing pruning.我有一个类似的问题,并通过将模型放入eval()中并在进行修剪之前传递一次虚拟输入来解决它。

Can you share the implementation code

Gooddz1 commented 6 months ago

Sorry, I am not really working on an identical case, I am working on tracing a yolov8 model and just mentioned the solution as the error message I got was very similar to yours.

A possible fix there could be that due to the fact that the yolov7 model possibly also has a hierarchy of classes (as my yolov8 has), model.model is the actual model that you want to save.

I followed your approach but still made the same mistake. Can you share this implementation code?