Dose linear support input_tensor with dtype int8?

Septend-fun commented 2 months ago

Hi, experts. It seems that the matmul (with weight and input-tensor's dtype both int8) is not supported right? I must convert weight to fp16 when using matmul op.

The key is in src/bindings.cpp

intel_npu_acceleration_library_DLL_API ov::op::Op* linear(intel_npu_acceleration_library::ModelFactory* factory,
                                                          ov::op::Op* in0, size_t dim0, size_t dim1, bool bias,
                                                          char* act_dtype, char* wt_dtype) {
    ov::element::Type_t act_ov_dtype = intel_npu_acceleration_library::dtype_from_string(std::string(act_dtype));
    ov::element::Type_t wt_ov_dtype = intel_npu_acceleration_library::dtype_from_string(std::string(wt_dtype));

    bool quantized = wt_ov_dtype == ov::element::Type_t::i8 || wt_ov_dtype == ov::element::Type_t::i4;

    auto weights = factory->parameter({dim0, dim1}, wt_ov_dtype);
    if (quantized) {
        weights = factory->convert_to(weights, act_ov_dtype);
    }

    auto mm = factory->matmul(in0, weights);

    if (quantized) {
        auto scale = factory->parameter({1, dim0}, act_ov_dtype);
        mm = factory->eltwise_mul(mm, scale);
    }

    if (bias) {
        auto bias = factory->parameter({1, dim0}, act_ov_dtype);
        return factory->eltwise_add(mm, bias);
    }
    return mm;
}

If I set act_dtype dtype as int8, then I will get this error: Matmul op #0 must be ranked tensor of 16 bit float or 32 bit float or 32 bit int , but got tensor<1x16x16xsi8> It is probably caused by openvino, but I think the NPU supports int8 x int8 op right?

Septend-fun commented 2 months ago

Hi, I have another question about the NPU latency. I got results when I tested matmul op.

If batch=32, inC=4096, outC=11008, the  latency is 16.58ms;
If batch=32, inC=11008, outC=4096, the  latency is 2.3ms;

I think these two cases have similar FLOPS and IO. Why do they have so big diff?

alessandropalla commented 2 months ago

Hi, I have another question about the NPU latency. I got results when I tested matmul op.
If batch=32, inC=4096, outC=11008, the  latency is 16.58ms;
If batch=32, inC=11008, outC=4096, the  latency is 2.3ms;
I think these two cases have similar FLOPS and IO. Why do they have so big diff?

Sorry I cannot reproduce this behavior.

Also, Op support is ongoing so stay tuned for new operations and dtypes to come

Septend-fun commented 2 months ago

Thanks for your reply. So in your test, you got the similar results, right? It may be caused by my environment, I'll check it.

alessandropalla commented 1 month ago

Any update? I'm happy to help if you need it. Otherwise I'll close the issue

intel / intel-npu-acceleration-library

Dose linear support input_tensor with dtype int8? #57