Memcopy (Host->Device) very slow on TX2 with Jetpack 4.5

Fred-Erik commented 3 years ago

Describe the bug A memcopy from CPU to GPU that apparently has to be done takes 1.44 seconds, where the total inference time for the computation is 1.60 seconds. This issue occurs on a Jetson TX2 with Jetpack 4.5 (Cuda 10.2, cuDNN 8.0), but does not take place on a Jetson TX2 with Jetpack 4.2.1 (Cuda 10.0, cuDNN 7.5.0): there the same memcopy only takes 50ms. This duration grows proportional to the batch size of the input.

This seems to be a bug in ONNXRuntime with newer CUDA/cuDNN versions.

I'm not sure what is being copied to GPU btw, or how to figure this out. I converted the model from Tensorflow 2.3 using tf2onnx.convert, and first copied the two inputs to the model to GPU using io_binding(), but this makes no difference in inference time, so I guess it's not the input that's being copied from CPU to GPU. If someone could explain how I do figure this out (and prevent it from happening if possible), that would also be much appreciated.

Urgency None

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04.5 LTS on Jetpack 4.5
ONNX Runtime installed from (source or binary): from source
ONNX Runtime version: 1.7.0 (without TensorRT bindings)
Python version: 3.6.9
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: CUDA 10.2.89 cuDNN 8.0.0.180
GPU model and memory: Jetson TX2 8GB

To Reproduce Run attached onnx.py on Jetpack 4.5 to reproduce issue.

onnxruntime_perf_test -e cuda -I ~/onnx/model.onnx -r 10 -t 10 also works, but because that batch size is set to 1, the issue is less clear.

Expected behavior You would expect the memcopy to be not slower than on older Jetpack (CUDA/cuDNN) versions

Additional attachments Attached a trace from both Jetpack 4.2.1 and Jetpack 4.5, the verbose log from Jetpack 4.5, the tf2onnx conversion log and the files to reproduce those traces. github issue.zip

Additional context The Tensorflow model is from https://github.com/alleveenstra/attentionocr where I removed the positional embedding generation using tf.eye because tf2onnx does not support this.

jywu-msft commented 3 years ago

sorry, this issue slipped through the cracks. do you know if this is an issue specific to JetPack or can it be reproduced on x64 as well? just to confirm, you're using the exact same version of ORT (built from src with CUDA execution provider) , only difference is JetPack, thus CUDA 10.0 vs 10.2 ? We need to isolate whether it's an issue with ORT or with the hardware platform (TX2/JetPack) or with CUDA.

Fred-Erik commented 3 years ago

I will try reproducing it on x64 in the coming days, I do not currently know if that it also occurs there.

And yes, that is exactly right! Cuda 10.0 vs 10.2 (or Jetpack 4.2.1 vs Jetpack 4.5) and apart from that the exact same setup, and without the TRT execution provider.

Fred-Erik commented 3 years ago

I have trouble compiling ONNX Runtime from source on my Ubuntu 18.04 x64 system. The unit tests all fail, it seems like something goes wrong with linking CUDA, because it cannot find my GPU (Titan Xp). Is there an easier option to test this than reinstall my Linux setup, globally install the right CUDA version, and then change the CUDA version to the other version to be tested and compile again?

Are there pre-compiled versions I can try that have CUDA 10.0 and 10.2 support, or could you compile them for me? I can easily run precompiled versions of ONNX Runtime in Anaconda environments with different CUDA versions, but these do not include the CUDA stuff needed to compile ONNX Runtime.

jywu-msft commented 3 years ago

unfortunately prebuilt binaries for onnxruntime 1.7.x is using CUDA 11 but you could try onnxruntime 1.6.0 which was built with CUDA 10.2 https://github.com/microsoft/onnxruntime/releases/tag/v1.6.0

for building from source, if you can use docker, maybe you could use this dockerfile: https://github.com/microsoft/onnxruntime/blob/master/dockerfiles/Dockerfile.cuda which uses nvidia cuda 10.2 as base image

Fred-Erik commented 3 years ago

Thank you! I did not have prior experience with Docker so this took a while. Useful to know though.

Turns out that on x64, CUDA 10.2 with cuDNN 8.0 is slightly faster than CUDA 10.0 with cuDNN 7.6.5 (cuDNN 7.5.0 is not available as a CUDA docker image so I could not test that). See attachments for the traces. Significantly, the memcpy times are respectively only 3% and 5% of the total runtime of a single batch 64 inference, compared to 90% of the runtime on the Jetson with CUDA 10.2.

attentionocr_onnxruntime_x64_traces.zip

System information of this test

OS Platform and Distribution: Ubuntu 18.04.5 LTS
ONNX Runtime installed from (source or binary): from source
ONNX Runtime version: 1.7.0 (without TensorRT bindings)
Python version: 3.6.9
GCC/Compiler version (if compiling from source): 7.5.0
CUDA/cuDNN version: one Docker image with CUDA 10.2, cuDNN 8.0, and the other Docker image with CUDA 10.0, cuDNN 7.6.5
GPU model and memory: Titan X (Pascal)

So the problem seems to be hardware-specific.

jywu-msft commented 3 years ago

Sorry for the slow response. The data you provide is very useful. It's good to know it's not an issue on x64. Unfortunately we don't have a TX2. Will try to reproduce on a Jetson NX to see if the behavior shows up there.

Fred-Erik commented 3 years ago

I managed to get ahold of a Jetson Xavier NX myself, and attached are my results. Jetson_Xavier_NX_Jetpack_4_5_1.zip

As you can see, the issue also occurs there, though it is less pronounced: one inference of batch 64 takes 173ms, where 73ms is spend copying the data to the GPU. Compare this with only 10ms spend memcopying of a total inference time of 328ms on the TX2 with Jetpack 4.2.1.

A Jetson Xavier NX needs a newer Jetpack so I cannot reproduce if the network inference would also be faster with older CUDA/cuDNN versions on this system, but this would seem to be the case.

So if you have a Jetson Xavier NX you should be able to reproduce the issue. Did you manage to make any progress yet?

Fred-Erik commented 3 years ago

Any news on this issue? This is currently a breaking issue for us, so we're staying at an old Tensorflow on our Jetson products. But it would be nice to be able to switch to ONNXRuntime at some point.

jywu-msft commented 3 years ago

Can you confirm the exact input shapes for your repro? (for the 2 inputs) I tried some values but the Memcpy time isn't obviously high as in your case.

jywu-msft commented 3 years ago

btw, in the past we experienced some strange i/o issues with some versions of JetPack. (i/o hangs on certain models, which turned out to be due to driver issue) so this may not be a CUDA issue at all.

Fred-Erik commented 3 years ago

Can you confirm the exact input shapes for your repro? (for the 2 inputs) I tried some values but the Memcpy time isn't obviously high as in your case.

I am running the same script as I supplied in the opening post to measure the inference time, so the shape of the inputs are (64, 64, 128, 1) and (64, 1, 55):

import numpy as np
from time import monotonic
import onnxruntime as rt

# load model with options
so = rt.SessionOptions()
so.enable_profiling = True
rt.set_default_logger_severity(0)
so.graph_optimization_level = rt.GraphOptimizationLevel.ORT_ENABLE_ALL

sess = rt.InferenceSession("model.onnx", sess_options=so)
io_binding = sess.io_binding()

# print shapes
for inp in sess.get_inputs():
    print("input name", inp.name)
    print("input shape", inp.shape)
    print("input type", inp.type)

for outp in sess.get_outputs():
    print("output name", outp.name)
    print("output name", outp.shape)
    print("output name", outp.type)

# create inputs
encoder_input = np.random.rand(64, 64, 128, 1).astype("float32")
decoder_input = np.zeros((64, 1, 55)).astype("float32")

# copy inputs to GPU
encoder_input_gpu = rt.OrtValue.ortvalue_from_numpy(encoder_input, "cuda", 0)
decoder_input_gpu = rt.OrtValue.ortvalue_from_numpy(decoder_input, "cuda", 0)
io_binding.bind_input(
    name="encoder_input:0",
    device_type=encoder_input_gpu.device_name(),
    device_id=0,
    element_type=np.float32,
    shape=encoder_input_gpu.shape(),
    buffer_ptr=encoder_input_gpu.data_ptr(),
)
io_binding.bind_input(
    name="decoder_input:0",
    device_type=decoder_input_gpu.device_name(),
    device_id=0,
    element_type=np.float32,
    shape=decoder_input_gpu.shape(),
    buffer_ptr=decoder_input_gpu.data_ptr(),
)
io_binding.bind_output("Identity:0")
print("Inputs are on device:", encoder_input_gpu.device_name(), decoder_input_gpu.device_name())

# run first time (may take longer than next times)
sess.run_with_iobinding(io_binding)
res = io_binding.copy_outputs_to_cpu()[0]
print(res)

# keep running
for _ in range(5):
    t0 = monotonic()
    sess.run_with_iobinding(io_binding)
    res = io_binding.copy_outputs_to_cpu()[0]
    print(f"Did inference in {monotonic() - t0}")

btw, in the past we experienced some strange i/o issues with some versions of JetPack. (i/o hangs on certain models, which turned out to be due to driver issue) so this may not be a CUDA issue at all.

But then you would have the same issue when you try to reproduce it on your Jetson Xavier NX right? What memcopy times do you see in your traces when you run my code on a Jetson with Jetpack 4.5?

Fred-Erik commented 3 years ago

Did you manage to replicate my results? If not, how can I help you replicate the issue? Because it seems to me that if I have 3 Jetson modules where I consistently get my results, it is not something to do with my specific hardware. If there is something wrong with my drivers, that would also imply that everyone has the same issue, because we do not use any drivers that are not supplied by Nvidia by default.

microsoft / onnxruntime

Memcopy (Host->Device) very slow on TX2 with Jetpack 4.5 #6783