Inference on Onnx with external data not working since PR 11320 (location planning logic)

pommedeterresautee commented 2 years ago

Describe the bug

A recent PR #11320 has fixed a bug making models with If node slower when input is only consumed by subgraphs of the If node. However, it seems that it has introduced a bug making Onnx Runtime crash when CUDA provider is used on a model with external data (> 2 Gb models). CPU provider is Ok.

It may be related to PR #11320 as code below to reproduce the bug:

do not crash when used with onnxruntime 1.11.1 from pipy
do not crash when used with onnxruntime 1.11.1 recompiled from source (commit 366f4ebcb425b6a47c2b0decd3b39fa14eb9dbf6)
crash when used with onnxruntime 1.12 compiled from master (commit 6f85d3e5c81c919022ac4a77e5a051da8518b15d)
crash when used with onnxruntime 1.12 compiled from the last commit of PR #11320 (e1c04eed29d48f295de1cfbd48713158537cdaa7)

As there is very little commits between 1.11.1 and PR #11320 , the PR may have introduced the new behavior.

Error message:

2022-05-12 22:44:03.142806435 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUBLAS failure 14: CUBLAS_STATUS_INTERNAL_ERROR ; GPU=0 ; hostname=geantvert ; expr=cublasGemmHelper( Base::CublasHandle(), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->template Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->template Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->template MutableData<T>()), ldc, device_prop); 
2022-05-12 22:44:03.142837979 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running MatMul node. Name:'MatMul_161' Status Message: CUBLAS error executing cublasGemmHelper( Base::CublasHandle(), transB, transA, static_cast<int>(helper.N()), static_cast<int>(helper.M()), static_cast<int>(helper.K()), &alpha, reinterpret_cast<const CudaT*>(right_X->template Data<T>()), ldb, reinterpret_cast<const CudaT*>(left_X->template Data<T>()), lda, &zero, reinterpret_cast<CudaT*>(Y->template MutableData<T>()), ldc, device_prop)
2022-05-12 22:44:03.142899439 [E:onnxruntime:Default, cuda_call.cc:118 CudaCall] CUDA failure 700: an illegal memory access was encountered ; GPU=0 ; hostname=geantvert ; expr=cudaEventRecord(current_deferred_release_event, static_cast<cudaStream_t>(GetComputeStream()));

Urgency if bug reproduced by Onnxruntime maintainers, this is probably to be fixed before next release of Onnxruntime

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04
ONNX Runtime installed from (source or binary): both (see above)
ONNX Runtime version: 1.11.1 and 1.12
Python version: 3.9
Visual Studio version (if applicable): N/A
GCC/Compiler version (if compiling from source): 3.10
CUDA/cuDNN version: 11.4
GPU model and memory: RTX 3090

To Reproduce Code below will generate the ONNX file and raise the error message.

from pathlib import Path
from typing import Tuple
import numpy as np
import torch
from torch.nn import Linear
from transformers import AutoModelForSeq2SeqLM, T5ForConditionalGeneration
from transformers.models.t5.modeling_t5 import T5Stack
from onnxruntime import SessionOptions, InferenceSession

model_name = "t5-3b"
model: T5ForConditionalGeneration = AutoModelForSeq2SeqLM.from_pretrained(model_name)
model = model.eval()
num_layers = model.config.num_layers

class ExportT5(torch.nn.Module):
    def __init__(self, decoder: T5Stack, lm_head: Linear):
        super(ExportT5, self).__init__()
        self.decoder = decoder
        self.lm_head = lm_head

    def forward(self, input_ids: torch.Tensor, encoder_hidden_states: torch.Tensor, past_key_values: Tuple = None):
        out_dec = self.decoder.forward(
            input_ids=input_ids, encoder_hidden_states=encoder_hidden_states, past_key_values=past_key_values
        )
        # Rescale output before projecting on vocab
        out_dec["last_hidden_state"] = out_dec["last_hidden_state"] * (model.model_dim**-0.5)
        out_dec["last_hidden_state"] = self.lm_head(out_dec["last_hidden_state"])
        return out_dec

model_decoder = ExportT5(decoder=model.decoder, lm_head=model.lm_head).eval()

def prepare_folder(path: str):
    Path(path).mkdir(parents=True, exist_ok=True)
    [item.unlink() for item in Path(path).glob("*") if item.is_file()]

dec_no_cache_folder = "./test-dec-no-cache"
dec_no_cache_model_path = dec_no_cache_folder + "/model.onnx"
prepare_folder(path=dec_no_cache_folder)

model_inputs = {
    "input_ids": torch.ones([1, 10], dtype=torch.int32, device="cpu"),
    "encoder_hidden_states": torch.ones([1, 2, 1024], dtype=torch.float32, device="cpu")
}

output_names = ["logits"]

for i in range(num_layers):
    output_names.append(f"present.{i}.decoder.key")
    output_names.append(f"present.{i}.decoder.value")
    output_names.append(f"present.{i}.encoder.key")
    output_names.append(f"present.{i}.encoder.value")

dynamic_axis = {
    "input_ids": {0: "batch", 1: "encoder_sequence"},
    "encoder_hidden_states": {0: "batch", 1: "encoder_sequence"},
    "logits": {0: "batch", 1: "decoder_sequence"},
}

for i in range(num_layers):
    dynamic_axis[f"present.{i}.decoder.key"] = {0: "batch", 2: "decoder_sequence"}
    dynamic_axis[f"present.{i}.decoder.value"] = {0: "batch", 2: "decoder_sequence"}
    dynamic_axis[f"present.{i}.encoder.key"] = {0: "batch", 2: "encoder_sequence_length"}
    dynamic_axis[f"present.{i}.encoder.value"] = {0: "batch", 2: "encoder_sequence_length"}

with torch.no_grad():
    model.config.return_dict = True
    model.eval()
    # pytorch 1.11 will use external data automatically
    torch.onnx.export(
        model_decoder,
        (model_inputs,),
        f=dec_no_cache_model_path,
        input_names=list(model_inputs.keys()),
        output_names=output_names,
        dynamic_axes=dynamic_axis,
        do_constant_folding=True,
        opset_version=13,
    )

# always work
options = SessionOptions()
model = InferenceSession(dec_no_cache_model_path, options, providers=["CPUExecutionProvider"])
print(model.run(None, {"input_ids": np.ones([1, 10], dtype=np.int32), "encoder_hidden_states": np.ones([1, 1, 1024], dtype=np.float32)}))

model = InferenceSession(dec_no_cache_model_path, options, providers=["CUDAExecutionProvider"])
print(model.run(None, {"input_ids": np.ones([1, 10], dtype=np.int32), "encoder_hidden_states": np.ones([1, 1, 1024], dtype=np.float32)}))

Expected behavior not crashing

Screenshots N/A

Additional context tagging @hariharans29 as he seems to know a lot about the topic

hariharans29 commented 2 years ago

That is strange. Can you try reverting the change and trying or building off a commit before my commit ?

"As there is very little commits between 1.11.1 and PR https://github.com/microsoft/onnxruntime/pull/11320...." -- This isn't true. There are quite a few commits that didn't make it into 1.11.1 that could have caused this. Please try with a commit before #11320.

pommedeterresautee commented 2 years ago

Thank you @hariharans29 for your fast answer. You are totally right and I am sorry for my misleading statement, there have been plenty of commits in between, I didn't realized until recently that in released version, very recent PRs are cherry picked...

So I just rebuilt from fdce4fa6af437b0b822958ab47b3b8f77f9e14ae which is the last commit in master before the merge of #11320 :

Command line used to build:

# git clone ...
git checkout -b before_11320 fdce4fa6af437b0b822958ab47b3b8f77f9e14ae
CUDACXX=/usr/local/cuda-11.4/bin/nvcc ./build.sh \                                                                                                                                                                             
    --config Release \
    --build_wheel \
    --parallel \
    --use_cuda \
    --cuda_home /usr/local/cuda-11.4 \
    --cudnn_home /usr/lib/x86_64-linux-gnu/ \
    --skip_test

There is the same error, so it is not related to the PR #11320, again sorry for my misleading statement.

To be sure it's not a compilation issue on my side, I re compiled a second time version 1.11.1 and there is no bug, so basically it's one of the many commits in between. Do you have an idea which PR I may test ?

pommedeterresautee commented 2 years ago

After a bunch of compilations, it seems that the issue appear with PR #11127 (which is related to external data). Linked issue: https://github.com/microsoft/onnxruntime/issues/10977 Tag to notify: @IkerAriz (PR author), @snnn (reviewer)

To reach this conclusion I have performed the following compilations (master branch):

April 18: 98faaa7e2fe74191c9b038f594146165286717cb -> error message
April 16: 0bad5b1b5ab9277108f3bbcbe5c62d60a67bbd5f -> error message
April 10: 00b595e389ebdb30c3ff2c2a261040b2e5907d08 -> error message
April 8: 541eff8d893020ac9ba9a60b0833649ac829ad3a (PR #11127) -> error message
April 8: 5637f171896158589f0e985291d5860c35daf37d (last commit before #11127) -> works
April 7: 4983d6e5d697ed898380a6cf0cba8610d83de13f -> works

Please, let me know if you can reproduce the error message.

snnn commented 2 years ago

Thanks. I will take a look.

pommedeterresautee commented 2 years ago

@snnn in case you had the time to work on this issue, have you been able to reproduce it? If yes, did you find a way to revert the commit and still have code compiling and working as expected with external data?

snnn commented 2 years ago

I tried a model: tf_inception_v1. It works fine on CPU.

pommedeterresautee commented 2 years ago

Hi, thank you for your test. It’s the same for me, crash is only with cuda provider.

snnn commented 2 years ago

Looking.

snnn commented 2 years ago

Thank you, I saw it. The buffers of the weights were not filled.

IkerAriz commented 2 years ago

Hi Changming. Which weight buffers were unfilled?

snnn commented 2 years ago

You can get a model from https://github.com/tensorflow/models/tree/master/research/slim. For example, resnet50. Then convert it to ONNX by using TF to onnx converter. Then onnx has an API to split weights to external. Then you run it with CUDA. Then I think the problem is 100% reproducible.

pommedeterresautee commented 2 years ago

Thank you a lot @snnn for the revert, I can confirm that current master branch compiled version with the PR revert works on my side too with several >2Gb NLP models too.

kiennguyen94 commented 2 years ago

Hi @snnn, #11789 is submitted to address this issue and reinstate the mmap copy bypass. Please have a look if you have a chance. Thank you 👍

microsoft / onnxruntime

Inference on Onnx with external data not working since PR 11320 (location planning logic) #11511