run_with_iobinding is not outputting the expected result for batched input data for T5 model running on ort CUDA EP

Quangnguyengiabku commented 1 year ago

Describe the issue

Describe the bug

hey, I've implemented the code for running T5 models on ort with CUDA EP, to avoid copying data between devices I'm using io_binding. for binding inputs I used io_binding.bind_input(). I've pre-calculated the shapes and created empty tensors (buffers) to hold the output values. for binding these outputs I used io_binding.bind_output(). then, I ran the Inference session with run_with_iobinding(io_binding). With this setup, I was able to get ~2X speedup for standard T5 models compared to PyTorch models on GPU (T4) for a single sequence. the output results were as expected for a single sequence i.e batch_size=1. but for batch_size>1 I'm getting the wrong outputs. for eg: (just for demonstration, we use t5 for a different task) batch_text_input = [ "translate English to French: Even when you remove the bright stars, the glowing dust, and other nearby points ", "translate English to French: of light from the inky, dark sky, a background glow remains.", ] output from onnx model running on ort (cuda ep) with io_binding : ["Même lorsque vous enlevez les étoiles brillantes, la poussière éclairante et d'autres points situés à proximité", ' '] Out put of onnxruntime.run():

["Même lorsque vous enlevez les étoiles brillantes, la poussière éclairante et d'autres points situés à proximité", 'd’une lumière du ciel foncé et enkyle, un éclat de fond demeure.']

as you can see from the above example the run_with_iobinding gives the right output only for the first index of input (batch) data and gives wrong results for the next index onwards... I try to bind all the inputs and output tensors as contiguous like Mr. @Ki6an at this https://github.com/microsoft/onnxruntime/issues/10992 but I still can't fix it. I checked the output of the decoder and the decoder_init. And I see the output of the decoder_init (logits, and past_kv) is right but at the decoder, the output is wrong how do I fix this issue? any help would be appreciated, thank you.

To reproduce

    DEVICE_NAME = 'cuda' 
    DEVICE_INDEX = 0
    binding = self.trt_context.io_binding()
    input_ids_tensor = input_ids.contiguous()
    encoder_attention_mask_tensor = encoder_attention_mask.contiguous()
    encoder_hidden_states_tensor = encoder_hidden_states.contiguous()
    assert input_ids_tensor.is_contiguous()
    binding.bind_input(
        name='input_ids',
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int32,
        shape=tuple(input_ids.shape),
        buffer_ptr=input_ids_tensor.data_ptr(),
        )
    assert encoder_hidden_states_tensor.is_contiguous()
    binding.bind_input(
        name='encoder_hidden_states',
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.float32,
        shape=tuple(encoder_hidden_states.shape),
        buffer_ptr=encoder_hidden_states_tensor.data_ptr(),
        )
    assert encoder_hidden_states_tensor.is_contiguous()
    binding.bind_input(
        name='encoder_attention_mask',
        device_type=DEVICE_NAME,
        device_id=DEVICE_INDEX,
        element_type=np.int32,
        shape=tuple(encoder_attention_mask.shape),
        buffer_ptr=encoder_attention_mask_tensor.data_ptr(),
        )
    dict_pkv_in ={}
    dict_pkv = {}
    input_ids_tensor.is_contiguous()
    for i in range(0, 24):
        name_pkv_in = "pkv_{}.1".format(i)
        name_pkv = "pkv_{}".format(i)
        dict_pkv_in[name_pkv_in] = past_key_values[i]
        assert dict_pkv_in[name_pkv_in].is_contiguous()
        binding.bind_input(
            name=name_pkv_in,
            device_type=DEVICE_NAME,
            device_id=DEVICE_INDEX,
            element_type=np.float32,
            shape=tuple(past_key_values[i].shape),
            buffer_ptr=dict_pkv_in[name_pkv_in].data_ptr(),
        )
        if i%4 >1:
            dict_pkv[name_pkv] =  torch.empty((past_key_values[i].shape[0],8,past_key_values[i].shape[2],64), dtype=torch.float32, device=DEVICE_NAME).contiguous()
            shape = (past_key_values[i].shape[0],8,past_key_values[i].shape[2],64)
        else: 
            dict_pkv[name_pkv] = torch.empty((past_key_values[i].shape[0],8,past_key_values[i].shape[2]+1,64), dtype=torch.float32, device=DEVICE_NAME).contiguous()
            shape = (past_key_values[i].shape[0],8,past_key_values[i].shape[2]+1,64)

        binding.bind_output(
            name=name_pkv,
            device_type=DEVICE_NAME,
            device_id=DEVICE_INDEX,
            element_type=np.float32,
            shape=shape,
            buffer_ptr=dict_pkv[name_pkv].data_ptr(),
        )
    logits = torch.empty((encoder_attention_mask.shape[0],1,vocab), dtype=torch.float32, device=DEVICE_NAME).contiguous()

    binding.bind_output(
            name='logits',
            device_type=DEVICE_NAME,
            device_id=DEVICE_INDEX,
            element_type=np.float32,
            shape=(encoder_attention_mask.shape[0],1,32100),
            buffer_ptr=logits.data_ptr(),
        )

Urgency

No response

Platform

Linux

OS Version

Linux 20.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.9.0

ONNX Runtime API

Python

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

Cuda 11.4

yuslepukhin commented 1 year ago

Are you running on 32-bit (x86 architecture)? It would be nice to see if things run correctly without binding. In that case, ORT would automatically copy things from the device to CPU and one can inspect it. You can also try not to bind the output and see what kind of shapes are produced by ORT and the data that is output.

Quangnguyengiabku commented 1 year ago

Are you running on 32-bit (x86 architecture)? It would be nice to see if things run correctly without binding. In that case, ORT would automatically copy things from the device to the CPU and one can inspect it. You can also try not to bind the output and see what kind of shapes are produced by ORT and the data that is output. Thank you for your comments. @yuslepukhin I have been running it on 64-bit(x86 architecture). T also ran it without binding on CUDAExcutionPovider with Inferencesession.run(), it give me the right outputs, but the performance is too low because it must transform the data between CPU and GPU. So, I need to run it with IObinding

yuslepukhin commented 1 year ago

Okay, however, to verify the outputs you still need to copy it to CPU. How did you come to that conclusion that the outputs are not correct?

Quangnguyengiabku commented 1 year ago

@yuslepukhin I often check the output of each decoder stack, as well as the final results I gave an example with a simple command, is prinf . How did I check the output in the right way?

yuslepukhin commented 1 year ago

@yuslepukhin I often check the output of each decoder stack, as well as the final results I gave an example with a simple command, is prinf . How did I check the output in the right way?

To check the output, you would need to copy it to the host memory, unless you have written a CUDA kernel that does it for you directly in the device memory.

Let me know if I understand this correctly. 1) When you run without binding the output, and onnxruntime performs copying from the device memory to host memory, the output is correct. 2) When you copy it and check, the output is not correct.

Is this a correct understanding?

Ki6an commented 1 year ago

@Quangnguyengiabku did you bind the output logits (it has a different shape than pkv) to io_binding for the decoder? I don't see it in your code.

Quangnguyengiabku commented 1 year ago

@yuslepukhin I often check the output of each decoder stack, as well as the final results I gave an example with a simple command, is prinf . How did I check the output in the right way?

To check the output, you would need to copy it to the host memory, unless you have written a CUDA kernel that does it for you directly in the device memory.

Let me know if I understand this correctly. 1) When you run without binding the output, and onnxruntime performs copying from the device memory to host memory, the output is correct. 2) When you copy it and check, the output is not correct.

Is this a correct understanding?

When I run without binding. The output will automatically copy to host memory(format dict of numpy), and when I check, the output is correct. And when I run with IObinding, I check it by copying the output from the device to memory. the output is not correct.

Quangnguyengiabku commented 1 year ago

@Quangnguyengiabku did you bind the output logits (it has a different shape than pkv) to io_binding for the decoder? I don't see it in your code.

Hi @Ki6an .The snip code I uploaded is lacking (I just update and you can see above). I also bind output logits with the size (bs,1, vocab_size). But the output is still incorrect.

Ki6an commented 1 year ago

@Quangnguyengiabku I've got this working on onnxruntime-gpu==1.11.0, maybe you can change that. also if possible can you also share the incorrect decoder output you are getting?

Quangnguyengiabku commented 1 year ago

@Quangnguyengiabku I've got this working on onnxruntime-gpu==1.11.0, maybe you can change that. also if possible can you also share the incorrect decoder output you are getting?

Dear @Ki6an . I printed the outputs when I run with batch_size =4 . Here are the outputs logits of some decoders: Decoder forward first: tensor(

       [[[ 2.2034,  5.2962,  2.0083,  ..., -2.4754, -2.3684, -3.8110]],

        [[15.8188, 33.4472,  5.6914,  ..., -1.3325, -1.4620,  3.2235]],

        [[ 2.2034,  5.2962,  2.0083,  ..., -2.4754, -2.3684, -3.8110]],

        [[15.8188, 33.4472,  5.6914,  ..., -1.3325, -1.4620,  3.2235]]], device='cuda:0')

Decoder forward second:

tensor([[[ 0.0679,  1.2573,  9.2988,  ..., -4.0065, -3.8124, -5.6398]],

        [[15.8188, 33.4472,  5.6914,  ..., -1.3325, -1.4620,  3.2235]],

        [[ 2.0916,  4.9050,  3.0106,  ..., -2.7321, -2.5865, -3.9142]],

        [[15.8188, 33.4472,  5.6914,  ..., -1.3325, -1.4620,  3.2235]]],
       device='cuda:0')

Decoder forward 3rd:

tensor(
[[[ 8.8772e-02,  3.2717e-02,  8.5838e+00,  ..., -3.5838e+00,
          -2.9585e+00, -5.1454e+00]],

        [[ 1.5819e+01,  3.3447e+01,  5.6914e+00,  ..., -1.3325e+00,
          -1.4620e+00,  3.2235e+00]],

        [[ 2.1206e+00,  4.7799e+00,  3.9673e+00,  ..., -3.0001e+00,
          -2.8041e+00, -4.0208e+00]],

        [[ 1.5819e+01,  3.3447e+01,  5.6914e+00,  ..., -1.3325e+00,
          -1.4620e+00,  3.2235e+00]]], device='cuda:0')

And the final output(input_ids) is :

tensor(
        [[    0,     1,  2951,   333,   585,   358,   326,   864,  9049,   263, 2,     0,     0,     0,     0,     0,     0,     0,     0,     0],
        [    0,     1,     1,     1,     1,     1,     1,     1,     1,     1, 1,     1,     1,     1,     1,     1,     1,     1,     1,     1],
        [    0,     1,  2951,  2951,  2951, 15670, 15670, 15670, 15670, 15670, 15670,  2951,  2951,  2951,  2951,  2951,  2951,  2951,  2951,  2951],
        [    0,     1,     1,     1,     1,     1,     1,     1,     1,     1, 1,     1,     1,     1,     1,     1,     1,     1,     1,     1]],
       device='cuda:0')

I also update onnxruntime-gpu to 1.11.0 but it still gives me the wrong output.

microsoft / onnxruntime