Can not inference the quantilized model in my device by int8

👉 Please follow one of these issue templates 👈

Note: to keep the backlog clean and actionable, issues may be immediately closed if they do not follow one of the above issue templates. I want to test the accuracy and time consumption of ibert in int8 inference, so I installed transformers and tried to quantize the Roberta-base model to generate the weights. I have set that quant_model is true and torch type is int8. However, the time consumption of ibert in int8 inference is similar to Roberta-base model in my 1080ti device. Is any problem with my config.json or my device? Here is my config.json: { "_name_or_path": "./outputs/checkpoint-1150/", "architectures": [ "IBertForSequenceClassification" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "eos_token_id": 2, "finetuning_task": "mrpc", "force_dequant": "none", "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": { "0": 0, "1": 1 }, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": { "0": 0, "1": 1 }, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "ibert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 1, "position_embedding_type": "absolute", "quant_mode": true, "tokenizer_class": "RobertaTokenizer", "torch_dtype": "int8", "transformers_version": "4.11.0.dev0", "type_vocab_size": 1, "vocab_size": 50265 }

kssteven418 / I-BERT

Can not inference the quantilized model in my device by int8 #15

👉 Please follow one of these issue templates 👈