Einsum_110: einsum input tensor 0 has invalid type Int32

oborchers commented 2 years ago

Description

This issue is a followup on #818 which I also created. I am working on the transformers deploy repository and created a PR that enables support for exporting larger transformers models to TensorRT.

This works well with gpt2-medium, gpt-neo-1.3b, and gpt-neo-2.7b. However, for GPT-J I am running into the following issue:

[04/29/2022-14:36:29] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[04/29/2022-14:36:29] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[04/29/2022-14:36:29] [TRT] [W] onnx2trt_utils.cpp:392: One or more weights outside the range of INT32 was clamped
[04/29/2022-14:36:29] [TRT] [E] [layers.cpp::validate::5677] Error Code 3: Internal Error (Einsum_110: einsum input tensor 0 has invalid type Int32)
[04/29/2022-14:36:29] [TRT] [W] building engine. depending on model size this may take a while
[04/29/2022-14:36:29] [TRT] [E] 4: [network.cpp::validate::2633] Error Code 4: Internal Error (Network must have at least one output)
[04/29/2022-14:36:29] [TRT] [E] 2: [builder.cpp::buildSerializedNetwork::609] Error Code 2: Internal Error (Assertion enginePtr != nullptr failed. )
Traceback (most recent call last):
  File "/usr/local/bin/convert_model", line 8, in <module>
    sys.exit(entrypoint())
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 386, in entrypoint
    main(commands=args)
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/convert.py", line 288, in main
    engine: ICudaEngine = build_engine(
  File "/usr/local/lib/python3.8/dist-packages/transformer_deploy/backends/trt_utils.py", line 132, in build_engine
    engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
TypeError: deserialize_cuda_engine(): incompatible function arguments. The following argument types are supported:
    1. (self: tensorrt.tensorrt.Runtime, serialized_engine: buffer) -> tensorrt.tensorrt.ICudaEngine

Invoked with: <tensorrt.tensorrt.Runtime object at 0x7fc959cd7c70>, None

As all other models seemingly work well, I assume this might be directly related to TRT?

Environment

TensorRT Version: 8.2.2-1 ONNX-TensorRT Version / Branch: 8.2.2.1 GPU Type:� V100 Nvidia Driver Version: 495.29.05 CUDA Version: 11.5 CUDNN Version: Operating System + Version: Ubuntu 20.04.3 LTS Python Version (if applicable): 3.8.10 TensorFlow + TF2ONNX Version (if applicable): NA PyTorch Version (if applicable): 1.10.2+cu113 Baremetal or Container (if container which image + tag): See below

Steps To Reproduce

Clone this PR: https://github.com/ELS-RD/transformer-deploy/pull/67
cd into main folder
docker build -t tfdeploy .
Run:

docker run -it --rm --shm-size=24g --ulimit memlock=-1 --ulimit stack=67108864 --gpus device=0 \
  -v $PWD:/project tfdeploy \
  bash -c "cd /project && \
    convert_model -m 'EleutherAI/gpt-j-6B' \
    --backend tensorrt \
    --seq-len 1 128 128 \
    --fast \
    --task text-generation"

oborchers commented 2 years ago

I am tagging @kevinch-nv and @yuanyao-nv because of their excellent help last time 🚀

kevinch-nv commented 2 years ago

This is a unfortunately a known limitation in the Einsum Layer in TRT - we only support floating-point types for Einsum equations.

Do you know which operation this Einsum equation is implying? Perhaps we can try and substitute the Einsum.

oborchers commented 2 years ago

@kevinch-nv: Thanks for the reply! I've been looking at the source code and afaict the only einsum op in the model is actually a float one:

def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
    dim = x.shape[-1]
    if seq_len is None:
        seq_len = x.shape[seq_dim]
    inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2) / dim))
    sinusoid_inp = (
        torch.einsum("i , j -> i j", torch.arange(seq_len, dtype=torch.float), inv_freq).to(x.device).float()
    )
    return torch.sin(sinusoid_inp), torch.cos(sinusoid_inp)

Am I missing something obvious?

kevinch-nv commented 2 years ago

It's possible that one of the inputs are being interpreted incorrectly as INT32. Can you provide the converted .onnx model?

moyix commented 2 years ago

@oborchers As a (slightly hacky) workaround, since fixed_pos_embedding does not depend on the input you can just precompute it for each dim you need and a maximum sequence length, and then use something like this:

def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
    dim = x.shape[-1]
    if seq_len is None:
        seq_len = x.shape[seq_dim]
    s = torch.load(f'sin_pos_{dim}.pt').to(x.device)
    c = torch.load(f'cos_pos_{dim}.pt').to(x.device)
    # Truncate to seq_len
    s = s[:seq_len]
    c = c[:seq_len]
    return s, c

PoodleWang commented 1 year ago

@oborchers Did you solve this issue? I got the same problem on the Salesforce/codegen-16b. They use the fixed_pos_embedding too. However, my colleagues could run the tensorrt engine successfully on the codege-350m.

onnx / onnx-tensorrt