intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
350 stars 38 forks source link

[Fusion]enable bloom mha fusion #286

Closed intellinjun closed 5 months ago

intellinjun commented 5 months ago

Enable bloom mha fusion Q4j32 Result image

intellinjun commented 5 months ago

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

model |   |   | input | output | core/instance | first_token | next_tokeon | total -- | -- | -- | -- | -- | -- | -- | -- | -- bloom_7B | q4_j_b128 | 1 | 32 | 32 | 56 | 125.17 | 23.13 | 842.2 bloom_7B | q4_j_b32 | 1 | 32 | 32 | 56 | 208.72 | 28.69 | 1098.11 bloom_7B | q4_j_b128 | 1 | 1024 | 32 | 56 | 14848.17 | 47.54 | 16321.91 bloom_7B | q4_j_b32 | 1 | 1024 | 32 | 56 | 15855.96 | 53.83 | 17524.69 with mha on machine 4 |   |   |   |   |   |   bloom_7B | q4_j_b128 | 1 | 32 | 32 | 56 | 90.62 | 17.47 | 632.19 bloom_7B | q4_j_b32 | 1 | 32 | 32 | 56 | 175.99 | 23.3 | 898.29 bloom_7B | q4_j_b128 | 1 | 1024 | 32 | 56 | 1076.52 | 19.48 | 1680.4 bloom_7B | q4_j_b32 | 1 | 1024 | 32 | 56 | 1916.32 | 27.24 | 2760.76