microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.88k stars 1.81k forks source link

RuntimeError: you can only change requires_grad flags of leaf variables. #5754

Open Cindytjj opened 3 months ago

Cindytjj commented 3 months ago

Describe the bug: RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach(). when I use NNI to prune my customed transformer ,it first looks great with these [2024-03-08 11:11:12] Update indirect mask for call_function: truediv, [2024-03-08 11:11:12] Update indirect mask for call_function: sqrt, [2024-03-08 11:11:12] Update indirect mask for call_function: getitem_13, [2024-03-08 11:11:12] Update indirect mask for call_function: getattr_3, [2024-03-08 11:11:12] Update indirect mask for call_method: transpose_2, output mask: 0.0000 [2024-03-08 11:11:12] Update indirect mask for call_method: view_2, output mask: 0.0000 [2024-03-08 11:11:12] Update indirect mask for call_module: encoder_encoder_layers_0_attention_value_projection, weight: 0.0000 bias: 0.0000 , output mask: 0.0000 until it throw an issue Traceback (most recent call last): File "F:\研究生学习文件\研二\时序预测算法\transformer\pythonProject2\0305\4.py", line 219, in ModelSpeedup(model, dummy_input, masks).speedup_model() File "E:\ANACONDA\Anaconda\envs\torch\lib\site-packages\nni\compression\speedup\model_speedup.py", line 435, in speedup_model self.update_indirect_sparsity() File "E:\ANACONDA\Anaconda\envs\torch\lib\site-packages\nni\compression\speedup\model_speedup.py", line 306, in update_indirect_sparsity self.node_infos[node].mask_updater.indirect_update_process(self, node) File "E:\ANACONDA\Anaconda\envs\torch\lib\site-packages\nni\compression\speedup\mask_updater.py", line 160, in indirect_update_process output = getattr(model_speedup, node.op)(node.target, args_cloned, kwargs_cloned) File "E:\ANACONDA\Anaconda\envs\torch\lib\site-packages\torch\fx\interpreter.py", line 289, in call_method return getattr(self_obj, target)(*args_tail, **kwargs) RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn't require differentiation use var_no_grad = var.detach().

我在使用NNI对自己定义的transformer模型进行剪枝的时候报出这个错误,我尝试用L1NormPruner和MovementPruner进行,并且参考了NNI官方对transformer模型的剪枝案例(没有使用案例中的知识蒸馏),都尝试无果,会在speedup的过程中报出以上错误,我无法判断是我对NNI的设置有问题还是我自己定义的transformer模型不符合NNI的标准,故而寻求帮助

Environment:

Reproduce the problem

print(model) config_list = [{ 'op_types': ['Linear'], 'op_names_re': ['encoder.encoder_layers.0.attention.*'], 'sparse_threshold': 0.1, 'granularity': [4, 4] }] pruner = MovementPruner(model, config_list, evaluator, warmup_step=10, cooldown_begin_step=20, regular_scale=20) pruner.compress(40, 4) print(model) pruner.unwrap_model() masks = pruner.get_masks() dummy_input = (torch.randint(0, 1, (32, 16, 1)).to(device).float(), torch.randint(0, 1, (32, 16, 1)).to(device).float())

replacer = TransformersAttentionReplacer(model)

ModelSpeedup(model, dummy_input, masks).speedup_model()

CustomTransformer( (embedding): Linear(in_features=1, out_features=64, bias=True) (positional_encoding): PositionalEncoding( (dropout): Dropout(p=0, inplace=False) ) (encoder): Encoder( (encoder_layers): ModuleList( (0): Encoderlayer( (attention): AttentionLayer( (inner_attention): FullAttention( (dropout): Dropout(p=0.1, inplace=False) ) (query_projection): Linear(in_features=64, out_features=64, bias=True) (key_projection): Linear(in_features=64, out_features=64, bias=True) (value_projection): Linear(in_features=64, out_features=64, bias=True) (out_projection): Linear(in_features=64, out_features=64, bias=True) ) (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (linear): Linear(in_features=64, out_features=64, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (linear_layers): ModuleList( (0): Linear(in_features=64, out_features=64, bias=True) ) ) (decoder): Decoder( (decoder_layers): ModuleList( (0): Decoderlayer( (self_attention): AttentionLayer( (inner_attention): FullAttention( (dropout): Dropout(p=0.1, inplace=False) ) (query_projection): Linear(in_features=64, out_features=64, bias=True) (key_projection): Linear(in_features=64, out_features=64, bias=True) (value_projection): Linear(in_features=64, out_features=64, bias=True) (out_projection): Linear(in_features=64, out_features=64, bias=True) ) (cross_attention): AttentionLayer( (inner_attention): FullAttention( (dropout): Dropout(p=0.1, inplace=False) ) (query_projection): Linear(in_features=64, out_features=64, bias=True) (key_projection): Linear(in_features=64, out_features=64, bias=True) (value_projection): Linear(in_features=64, out_features=64, bias=True) (out_projection): Linear(in_features=64, out_features=64, bias=True) ) (norm1): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((64,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=64, out_features=256, bias=True) (linear2): Linear(in_features=256, out_features=64, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (norm): LayerNorm((64,), eps=1e-05, elementwise_affine=True) ) (fc_in): Linear(in_features=64, out_features=64, bias=True) (relu): ReLU() (dropout): Dropout(p=0.1, inplace=False) (fc_out): Linear(in_features=64, out_features=1, bias=True) )