microsoft / DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.
MIT License
2.19k stars 290 forks source link

Timings feedback running large language model. #434

Open elephantpanda opened 1 year ago

elephantpanda commented 1 year ago

Using Onnxruntime and DirectML here are my test results. Unfortunately DirectML is not so good at running LLMS it seems:

Notice how the times leap up when you change the input size from 512 tokens to 511 tokens (even though the input is now smaller!) The test is done by doing 10 passes with 512 tokens followed by then 10 passes with 511 tokens: (The model used is cerebras 111M)

DIRECTML RESULTS

float32 DirectML 
0.19ms 512
0.18ms 512
0.18ms 512
0.19ms 512
0.15ms 512
0.15ms 512
0.17ms 512
0.18ms 512
0.18ms 512
0.18ms 512
-------   (15% slow down)
0.18ms 511
0.18ms 511
0.20ms 511
0.21ms 511
0.22ms 511
0.18ms 511
0.19ms 511
0.20ms 511
0.21ms 511
0.22ms 511

float16-directml
0.11ms 512
0.11ms 512
0.11ms 512
0.12ms 512
0.12ms 512
0.12ms 512
0.12ms 512
0.12ms 512
0.12ms 512
0.11ms 512
---------- (36% slow down)
0.15ms 511
0.15ms 511
0.15ms 511
0.15ms 511
0.16ms 511
0.14ms 511
0.14ms 511
0.14ms 511
0.19ms 511
0.16ms 511

int8 static quantization
0.22ms 512
0.25ms 512
0.24ms 512
0.24ms 512
0.23ms 512
0.23ms 512
0.24ms 512
0.24ms 512
0.24ms 512
0.22ms 512
-------------- (33% slow down)
0.33ms 511
0.32ms 511
0.30ms 511
0.30ms 511
0.32ms 511
0.30ms 511
0.30ms 511
0.32ms 511
0.32ms 511
0.31ms 511

int8 dynamic quantization
0.22ms 512
0.24ms 512
0.24ms 512
0.23ms 512
0.23ms 512
0.24ms 512
0.24ms 512
0.24ms 512
0.23ms 512
0.23ms 512
------------    (20% slow down)
0.29ms 511
0.29ms 511
0.29ms 511
0.28ms 511
0.28ms 511
0.25ms 511
0.27ms 511
0.28ms 511
0.28ms 511
0.26ms 511

As you can see in DirectML changing the size of the input leads to a significant slow down that it never recovers from. When changing the input size several times I have found up to 2x slow down in performance. Is there a way round this for example, allocating sufficient memory on the GPU? I think this might be a DirectML only problem.

Padding the input is not a great solution because we want to take advantage of when the inputs are small to reduce times.

CUDA RESULTS

float32-cuda
0.11ms 512
0.11ms 512
0.11ms 512
0.12ms 512
0.10ms 512
0.10ms 512
0.11ms 512
0.12ms 512
0.11ms 512
0.11ms 512
------------  0% slowdown
0.11ms 511
0.12ms 511
0.11ms 511
0.10ms 511
0.10ms 511
0.11ms 511
0.11ms 511
0.11ms 511
0.11ms 511
0.12ms 511

float16-cuda
0.07ms 512
0.06ms 512
0.07ms 512
0.07ms 512
0.07ms 512
0.07ms 512
0.07ms 512
0.07ms 512
0.07ms 512
0.06ms 512
------------- 0% slowdown
0.06ms 511
0.06ms 511
0.06ms 511
0.06ms 511
0.07ms 511
0.07ms 511
0.07ms 511
0.06ms 511
0.07ms 511
0.06ms 511

As you can see, for CUDA there is 0% slow down when changing the input sizes. i.e. number of tokens.

Any way to make DirectML run language models better?

elephantpanda commented 1 year ago

I have made a new discovery! 😯 (I am using onnxruntime, dynamically quantized model)

When you continually increase the input each time, the inference rate slowly increases until it reaches a power of 2 starting at 32:

32, 64, 128,....

Then there is a dramatic slow down at each of these input sizes.

So it seems like, what it is doing it is thinking "the user is changing using a dynamic input size so I better increase the input memory to the next highest power of 2"

Thus DirectML is allocating memory in terms of powers of 2.

This seems to slow down the inference as if the input is padded to the next power of 2 in size.

When ever you increase the input size beyond the next power of two you get a 50% performance hit from then on.

This behaviour is not documented. Perhaps it is designed this way to be based internally on powers of 2-textures.

It would be nice if this could be turned off.

This is bad because if you have an input of size 129 it is as slow as an input of size 256 or even 512.

Replicate: do a loop of inferences increasing the input each time. Timing the inferences.

(I don't know if this is just a c# thing or a directml thing)

cerebras 1.3B dynamic quantized to int8:

0.09ms 1 
0.18ms 2 🔥
0.17ms 3
0.17ms 4
0.17ms 5
0.18ms 6
0.17ms 7
0.18ms 8
0.20ms 9
0.17ms 10
0.18ms 11
0.18ms 12
0.18ms 13
0.18ms 14
0.19ms 15
0.19ms 16
0.22ms 17
0.20ms 18
0.21ms 19
0.21ms 20
0.21ms 21
0.20ms 22
0.20ms 23
0.21ms 24
0.21ms 25
0.21ms 26
0.20ms 27
0.20ms 28
0.21ms 29
0.22ms 30
0.21ms 31
0.21ms 32
0.29ms 33 🔥
0.26ms 34
0.27ms 35
0.26ms 36
0.26ms 37
0.27ms 38
0.26ms 39
0.27ms 40
0.28ms 41
0.27ms 42
0.27ms 43
0.27ms 44
0.27ms 45
0.28ms 46
0.27ms 47
0.27ms 48
0.28ms 49
0.27ms 50
0.28ms 51
0.29ms 52
0.28ms 53
0.28ms 54
0.28ms 55
0.28ms 56
0.29ms 57
0.29ms 58
0.29ms 59
0.29ms 60
0.28ms 61
0.30ms 62
0.30ms 63
0.28ms 64
0.40ms 65 🔥
0.39ms 66
0.39ms 67
0.39ms 68
0.39ms 69
0.39ms 70