attila-dusnoki-htec commented 7 months ago

Tracking issue for creating (complex) E2E (End-to-End) example apps using MIGraphX

attila-dusnoki-htec commented 7 months ago

Llama-2

As mentioned here

To test it with MIGraphX we can update these two apps:

https://github.com/microsoft/Llama-2-Onnx/tree/main/MinimumExample
- Replate ORT with MGX
https://github.com/microsoft/Llama-2-Onnx/tree/main/ChatApp
- There is an inference part, which requires the same changes

Currently this version of llama-2 does not compile

attila-dusnoki-htec commented 7 months ago

Whisper

ORT version

The original repo uses pytorch, but there is a repo here with onnxruntime. To use onnxruntime with migraphx (without modifying the code), it can be built with ort as a provider.

Currently only the encoder compiles. The decoder fails with the following: /code/AMDMIGraphX/src/onnx/parse_matmul.cpp:78: parse: PARSE_MATMUL: dynamic shape broadcasting not supported

Hugging Face

Model used

It requires modifications to export with optimum. The attention_mask is not exposed by default.

A WIP example script to use the model.

attila-dusnoki-htec commented 7 months ago

Stable Diffusion

Hugging Face model

python

PyTorch example in python MIGraphX example python

c++

GGML example in C++ MIGraphX example in C++

Docker

This dockerfile can be used for testing these examples.

attila-dusnoki-htec commented 7 months ago

FP16

Currently not everything works with half precision.

Stable Diffusion

TextEncoder and VAE-Decoder works

Unet results in nans

The problem occurs here:

MIGRAPHX_TRACE_EVAL=2 /code/AMDMIGraphX/build/bin/migraphx-driver verify models/sd21-onnx/unet/model.onnx --input-dim @sample 1 4 64 64 @encoder_hidden_states 1 77 1024 @timestep 1 --fp16 --fill1 timestep

Run instruction: @2979 = slice[axes={0},starts={0},ends={5}](@2976) -> half_type, {5, 4096, 64}, {262144, 64, 1}, target_id=0
Time: 0.01197ms, 0.01274ms
Output has normal
Output: -193.875, 156.375, 140.5, 96.4375, 141, ..., 92.8125, 43.6562, 40.5938, 103.25, -76.4375
Min value: -399.5, Max value: 466.5, Mean: 5.80163, StdDev: 146.348
Run instruction: @2980 = load[offset=190006400,end=357778560](@1) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 0.00414ms, 0.00445ms
Run instruction: @2981 = gpu::gemm[alpha=0.125,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@2979,@2978,@2980) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 0.112007ms, 0.399038ms
Output has inf, normal
Output: -55616, -54784, -53408, -52640, -53888, ..., -23792, -23520, -22976, -22784, -22592
Min value: -inf, Max value: 28624, Mean: -inf, StdDev: -nan
Run instruction: @2982 = load[offset=22234240,end=190006400](@1) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 0.00726ms, 0.00815ms
Run instruction: @2983 = gpu::code_object[code_object=10496,symbol_name=softmax_kernel,global=5242880,local=256,](@2981,@2982) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 0.046599ms, 0.32677ms
Output has normal, nan, zero
Output: 0, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0
Min value: 0, Max value: 1, Mean: -nan, StdDev: -nan

@2968 = load[offset=9127040,end=11748480](@1) -> half_type, {1, 4096, 320}, {1310720, 320, 1}, target_id=0
@2969 = gpu::code_object[code_object=10768,symbol_name=add_reduce_mean_sub_pow_reduce_mean_add_sqrt_div_kernel,global=524288,local=128,](@2967,@2963,@2968) -> half_type, {1, 4096, 320}, {1310720, 320, 1}, target_id=0
@2970 = load[offset=22234240,end=30098560](@1) -> half_type, {1, 4096, 960}, {3932160, 960, 1}, target_id=0
@2971 = gpu::gemm[alpha=1,beta=1,compute_fp32=1,trans_batch=0,solution_idx=0](@2969,@2964,@2965,@2970) -> half_type, {1, 4096, 960}, {3932160, 960, 1}, target_id=0
@2972 = reshape_lazy[dims={1, 4096, 15, 64}](@2971) -> half_type, {1, 4096, 15, 64}, {3932160, 960, 64, 1}, target_id=0
@2973 = transpose[permutation={0, 2, 1, 3}](@2972) -> half_type, {1, 15, 4096, 64}, {3932160, 64, 960, 1}, target_id=0
@2974 = load[offset=14369920,end=22234240](@1) -> half_type, {1, 15, 4096, 64}, {3932160, 262144, 64, 1}, target_id=0
@2975 = gpu::code_object[code_object=8712,symbol_name=contiguous_kernel,global=983040,local=1024,](@2973,@2974) -> half_type, {1, 15, 4096, 64}, {3932160, 262144, 64, 1}, target_id=0
@2976 = reshape_lazy[dims={15, 4096, 64}](@2975) -> half_type, {15, 4096, 64}, {262144, 64, 1}, target_id=0
@2977 = slice[axes={0},starts={5},ends={10}](@2976) -> half_type, {5, 4096, 64}, {262144, 64, 1}, target_id=0
@2978 = transpose[permutation={0, 2, 1}](@2977) -> half_type, {5, 64, 4096}, {262144, 1, 64}, target_id=0
@2979 = slice[axes={0},starts={0},ends={5}](@2976) -> half_type, {5, 4096, 64}, {262144, 64, 1}, target_id=0
@2980 = load[offset=190006400,end=357778560](@1) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
@2981 = gpu::gemm[alpha=0.125,beta=0,compute_fp32=1,trans_batch=0,solution_idx=0](@2979,@2978,@2980) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
@2982 = load[offset=22234240,end=190006400](@1) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
@2983 = gpu::code_object[code_object=10496,symbol_name=softmax_kernel,global=5242880,local=256,](@2981,@2982) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
@2984 = slice[axes={0},starts={10},ends={15}](@2976) -> half_type, {5, 4096, 64}, {262144, 64, 1}, target_id=0
@2985 = load[offset=9127040,end=11748480](@1) -> half_type, {5, 4096, 64}, {262144, 64, 1}, target_id=0

Ref

The reference implementation shows the same issue:

fp32

Run instruction: @6525 = ref::dot(@6521,@6524) -> float_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 1561.65ms, 1561.65ms
Output has normal
Output: -453921, -447240, -438036, -430549, -439314, ..., -192728, -190610, -185909, -184388, -183079
Min value: -742234, Max value: 230521, Mean: -368671, StdDev: 153924

fp16

Run instruction: @4528 = ref::dot(@4517,@4527) -> half_type, {5, 4096, 4096}, {16777216, 4096, 1}, target_id=0
Time: 1508.01ms, 1508.01ms
Output has normal, inf
Output: -inf, -inf, -inf, -inf, -inf, ..., -inf, -inf, -inf, -inf, -inf
Min value: -inf, Max value: inf, Mean: -nan, StdDev: -nan

Llama-2

Looking at the results, the fp32 has inf and nan in the result. But it comes from the masking, and it is expected.

Run instruction: @640 = ref::multibroadcast[out_lens={1, 32, 256, 256},out_dyn_dims={}](@562) -> float_type, {1, 32, 256, 256}, {65536, 0, 256, 1}, target_id=0
Time: 0.00487ms, 0.00491ms
Output has inf, normal, zero
Output: 0, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, ..., -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38
Min value: -inf, Max value: 0, Mean: -inf, StdDev: -nan
Run instruction: @641 = ref::contiguous(@640) -> float_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 47.8585ms, 47.8587ms
Output has inf, normal, zero
Output: 0, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, ..., -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38
Min value: -inf, Max value: 0, Mean: -inf, StdDev: -nan
Run instruction: @642 = ref::add(@639,@641) -> float_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 8.60302ms, 8.60312ms
Output has inf, normal
Output: 2.53309, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, ..., -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38, -3.40282e+38
Min value: -inf, Max value: 9.41974, Mean: -inf, StdDev: -nan
Run instruction: @643 = ref::softmax[axis=3](@642) -> float_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 7.95904ms, 7.95916ms
Output has zero, normal
Output: 1, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0
Min value: 0, Max value: 1, Mean: 0.00390625, StdDev: 0.0240073

The fp16 version has it as well:

Run instruction: @355 = ref::multibroadcast[out_lens={1, 32, 256, 256},out_dyn_dims={}](@287) -> half_type, {1, 32, 256, 256}, {65536, 0, 256, 1}, target_id=0
Time: 0.0025ms, 0.00254ms
Output has inf, normal, zero
Output: 0, -65504, -65504, -65504, -65504, ..., -65504, -65504, -65504, -65504, -65504
Min value: -inf, Max value: 0, Mean: -inf, StdDev: -nan
Run instruction: @356 = ref::contiguous(@355) -> half_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 43.9691ms, 43.9692ms
Output has inf, normal, zero
Output: 0, -65504, -65504, -65504, -65504, ..., -65504, -65504, -65504, -65504, -65504
Min value: -inf, Max value: 0, Mean: -inf, StdDev: -nan
Run instruction: @357 = ref::add(@354,@356) -> half_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 2.98311ms, 2.98321ms
Output has zero, inf, normal
Output: 2.53516, -65472, -65472, -65472, -65472, ..., -65504, -65504, -65504, -65504, -65504
Min value: -inf, Max value: 9.42188, Mean: -inf, StdDev: -nan
Run instruction: @358 = ref::softmax[axis=3](@357) -> half_type, {1, 32, 256, 256}, {2097152, 65536, 256, 1}, target_id=0
Time: 5.48778ms, 5.48788ms
Output has zero, normal
Output: 1, 0, 0, 0, 0, ..., 0, 0, 0, 0, 0
Min value: 0, Max value: 1, Mean: 0.00390506, StdDev: 0.0239995

But there are pows which gets a bit out of hand for the lower precision.

The first 6 of 65 pows with fp32:

Run instruction: @566 = ref::pow(@563,@565) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 4.5869ms, 4.58699ms
Output has zero, normal
Output: 3.38076e-06, 1.45519e-05, 9.24105e-07, 8.58307e-06, 1.59824e-05, ..., 2.50679e-11, 5.13012e-12, 6.96332e-13, 4.14389e-11, 7.99361e-13
Min value: 0, Max value: 0.0178995, Mean: 4.2016e-06, StdDev: 6.058e-05

Run instruction: @657 = ref::pow(@654,@656) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 3.02639ms, 3.0265ms
Output has normal
Output: 9.25549e-05, 0.00138359, 5.57778e-05, 0.000400907, 0.00149003, ..., 5.14963e-05, 0.000458215, 1.88229e-06, 1.4133e-05, 0.000107119
Min value: 1.5283e-12, Max value: 0.564558, Mean: 0.000190222, StdDev: 0.0029691

Run instruction: @689 = ref::pow(@686,@688) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 3.2122ms, 3.21228ms
Output has normal
Output: 0.000135018, 0.000127227, 0.00313293, 0.0010952, 0.0218154, ..., 0.00161709, 0.20428, 6.98841e-05, 0.00672539, 0.00149917
Min value: 3.36059e-12, Max value: 22.6929, Mean: 0.0111376, StdDev: 0.255466

Run instruction: @780 = ref::pow(@777,@779) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 2.8067ms, 2.80677ms
Output has normal
Output: 0.000343903, 0.00054023, 0.00646422, 0.00035014, 0.0129346, ..., 6.03671e-05, 0.142149, 0.00201874, 0.00681912, 0.00107211
Min value: 2.72005e-15, Max value: 13.7998, Mean: 0.00849891, StdDev: 0.200751

Run instruction: @812 = ref::pow(@809,@811) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 3.50274ms, 3.50282ms
Output has normal
Output: 0.00488131, 0.261523, 0.00412025, 0.00380722, 0.00340024, ..., 1.33656e-05, 0.0975391, 0.00262303, 0.0078206, 0.00041745
Min value: 1.38778e-15, Max value: 569485, Mean: 2.79251, StdDev: 629.777

Run instruction: @903 = ref::pow(@900,@902) -> float_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 2.96483ms, 2.96493ms
Output has normal
Output: 0.00556073, 0.258198, 0.00398735, 0.00294377, 0.00266523, ..., 0.0058447, 0.124247, 1.0684e-05, 0.00872876, 0.000976011
Min value: 4.996e-16, Max value: 569493, Mean: 2.80136, StdDev: 629.854

And the same ones with fp16:

Run instruction: @290 = ref::pow(@271,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 8.88927ms, 8.88934ms
Output has zero, normal
Output: 3.33786e-06, 1.45435e-05, 8.9407e-07, 8.58307e-06, 1.5974e-05, ..., 0, 0, 0, 0, 0
Min value: 0, Max value: 0.0178986, Mean: 4.20052e-06, StdDev: 6.05707e-05

Run instruction: @367 = ref::pow(@366,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 1.82259ms, 1.82267ms
Output has zero, normal
Output: 9.22084e-05, 0.00138378, 5.56707e-05, 0.000400782, 0.00148964, ..., 5.126e-05, 0.00045681, 1.84774e-06, 1.40667e-05, 0.000106812
Min value: 0, Max value: 0.5625, Mean: 0.000189523, StdDev: 0.00295725

Run instruction: @392 = ref::pow(@391,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 1.93096ms, 1.93115ms
Output has zero, normal
Output: 0.000134945, 0.000124693, 0.0031147, 0.00109196, 0.0217438, ..., 0.00160027, 0.203247, 6.96182e-05, 0.00668716, 0.00148964
Min value: 0, Max value: 22.5938, Mean: 0.0110913, StdDev: 0.254561

Run instruction: @453 = ref::pow(@452,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 1.87675ms, 1.87683ms
Output has zero, normal
Output: 0.000343561, 0.000535011, 0.00643921, 0.000348806, 0.0128708, ..., 6.19888e-05, 0.141357, 0.00200653, 0.00676727, 0.00106621
Min value: 0, Max value: 13.75, Mean: 0.0084596, StdDev: 0.199949

Run instruction: @478 = ref::pow(@477,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 1.87855ms, 1.87876ms
Output has zero, inf, normal
Output: 0.00484085, 0.259766, 0.00411987, 0.00378418, 0.00338936, ..., 1.39475e-05, 0.0970459, 0.00259972, 0.00775528, 0.000410557
Min value: 0, Max value: inf, Mean: inf, StdDev: -nan

Run instruction: @539 = ref::pow(@538,@289) -> half_type, {1, 256, 4096}, {1048576, 4096, 1}, target_id=0
Time: 1.90855ms, 1.90885ms
Output has zero, inf, normal
Output: 0.00484085, 0.259766, 0.00411987, 0.00378418, 0.00338936, ..., 0.00645065, 0.125488, 1.10269e-05, 0.00844574, 0.00102425
Min value: 0, Max value: inf, Mean: inf, StdDev: -nan

attila-dusnoki-htec commented 5 months ago

These 2 issues are reported here: https://github.com/ROCm/AMDMIGraphX/issues/2555 and https://github.com/ROCm/AMDMIGraphX/issues/2556

The solution will be postponed until the float propagation is resolved.

migraphx-benchmark / AMDMIGraphX

End-to-End examples with MIGraphX #154

Llama-2

Whisper

ORT version

Hugging Face

Stable Diffusion

python

c++

Docker

FP16

Stable Diffusion

Ref

fp32

fp16

Llama-2