Open GavinJiacheng opened 3 years ago
@jywu-msft @HectorSVC do any of you have access to the Jetson device for this investigation?
I don't see the timing code here, so not sure if you included the time needed to create the ORT session. You should exclude that time. Also, it looks like x1 and x2 are first allocated on the gpu, then copied to cpu (to_numpy) and then created on the gpu again (ortvalue_from_numpy). Hope this is not included in the timing?
I don't see the timing code here, so not sure if you included the time needed to create the ORT session. You should exclude that time. Also, it looks like x1 and x2 are first allocated on the gpu, then copied to cpu (to_numpy) and then created on the gpu again (ortvalue_from_numpy). Hope this is not included in the timing?
@pranavsharma Actually, I used the timing code before. I just deleted it before I put the code here.
I used time.time() to check the code ort_outs = ort_session.run([], ort_inputs)
and torch_out = model(left, right)
on pytorch. That's why I got the value 2.5 sec and 14 sec.
And yes, the converting is not including in the timing. the timing is only for one line. Sorry, the code is a little bit messy since we did many tests to see why the speed is so slow.
I don't have a Jetson TX2 to debug. But, one thing to check is to see if any of the nodes in the graph got assigned to CPU during the graph partitioning phase. Turn on verbose logging and look at the logs to see this info.
@pranavsharma We checked the log. There is no "fall back to CPU". Only few lines have the keyword "CPU":
2021-04-05 13:07:50.370189803 [I:onnxruntime:Default, bfc_arena.cc:23 BFCArena] Creating BFCArena for CUDA_CPU with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-04-05 13:07:50.370303887 [I:onnxruntime:Default, bfc_arena.cc:23 BFCArena] Creating BFCArena for Cpu with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 memory limit: 18446744073709551615 arena_extend_strategy 0
2021-04-05 13:07:52.420892341 [I:onnxruntime:Default, bfc_arena.cc:280 AllocateRawInternal] Extending BFCArena for Cpu. bin_num:17 rounded_bytes:35389440
Describe the bug The speed of running the onnx model is 6x slower than running it on PyTorch
Urgency April 20 /2020
System information
To Reproduce I ran the code to run the ONNX model:
The model I used is the GwcNet, the model is:
The code we used to convert the PyTorch model to ONNX is:
The code we used to run the model on PyTorch is:
The download link of our ONNX file
Expected behavior
The speed of running this model on PyTorch is around 2.5s for one frame. However, it is around 14s when we run it on the ONNX model. It is around 6 times slower.
We checked the log, didn't find the warning "fall back to CPU".