Potential improvements to jpeg decoding on GPU

NicolasHug commented 3 years ago

A minimal version of jpeg decoding on GPUs was implemented in https://github.com/pytorch/vision/pull/3792. Here's a list of potential future improvements:

Support for A100 devices
Support for batch decoding (I didn't see any speed improvement in my experiments in https://github.com/pytorch/vision/pull/2786#issuecomment-832148710, but perhaps I missed something)
Use a finer-grained API for the decoding phases, and potentially change the decoding backend depending on the image size, taking inspiration from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/nvJPEG/nvJPEG-Decoder-MultipleInstances
As per https://github.com/pytorch/vision/pull/3792#discussion_r629290933, we could:
- Avoid creating tensor views and use some pointer arithmetic
- investigate whether the layout (CHW vs HWC) has an impact on performance

cceyda commented 3 years ago

I have just tested the new v0.10.0 release with beta support for nvjpeg. But I found it to be slower x2.

images_bytes=[np.frombuffer(open(os.path.join(folder,a),'rb').read(), dtype=np.uint8) for a in os.listdir(folder) if 
 a.endswith('jpg')]

#%%timeit -n 1 -r 100
for img_bytes in images_bytes:
    z=torch.from_numpy(img_bytes)
    z=decode_jpeg(z, device='cuda') # z=decode_jpeg(z)

Using:

Titan RTX
Cuda 10.2
python 3.6.9

benchmarking code: https://github.com/cceyda/image-checker/blob/master/examples/benchmark_jpeg_decode_extended.ipynb

Also kept getting below error with cuda 11.1

~/.local/lib/python3.6/site-packages/torchvision/io/image.py in decode_jpeg(input, mode, device)
     174     device = torch.device(device)
     175     if device.type == 'cuda':
 --> 176         output = torch.ops.image.decode_jpeg_cuda(input, mode.value, device)
     177     else:
     178         output = torch.ops.image.decode_jpeg(input, mode.value)

 RuntimeError: nvjpegDecode failed: 5

NicolasHug commented 3 years ago

hi @cceyda , the GPU benchmarks you're reporting should be using something like torch.cuda.synchronize between each run, to get accurate results. For more comparable reasults, would you mind using something like the code in https://github.com/pytorch/vision/pull/2786#issuecomment-832148710 ? You can find it by clicking on the "Benchmark code for ref" part.

Also please note that this issue is for tracking potential improvements to the GPU decoding. Could you please submit the bug failure as a separate issue? It would be easier to keep track of it.

cceyda commented 3 years ago

Even with the benchmark code I adapted from nvjpeg_bench.py used in #2786 (comment) I always get slower results with cuda decoding. I have tried many many different versions of benchmarking.

nvjpeg_bench.py below:

import torch
from torch.utils.benchmark import Timer
from torchvision.io.image import decode_jpeg, read_file, ImageReadMode, write_jpeg, encode_jpeg
from torchvision import transforms as T

img_path = './grace_hopper_517x606.jpg'
data = read_file(img_path)
img = decode_jpeg(data)

def sumup(name, mean, median, throughput, fps):
    print(
        f"{name:<10} mean: {mean:.3f} ms, median: {median:.3f} ms, "
        f"Throughput = {throughput:.3f} Megapixel / sec, "
        f"{fps:.3f} fps"
    )

print(f"img.shape = {img.shape}")
print(f"data.shape = {data.shape}")
height, width = img.shape[-2:]

num_pixels = height * width
num_runs = 100

stmt = "a=decode_jpeg(data, device='{}')\na=a.to(device='cuda:0')" # added .to(device) to account for moving to gpu time
setup = 'from torchvision.io.image import decode_jpeg'
globals = {'data': data}

for device in ('cpu', 'cuda'):
    t = Timer(stmt=stmt.format(device), setup=setup, globals=globals).timeit(num_runs)
    sumup(device, t.mean * 1000, t.median * 1000, num_pixels / 1e6 / t.median, 1 / t.median)

Server 1 ENV:

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.4 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.14.4
Libc version: glibc-2.25

Python version: 3.6 (64-bit runtime)
Python platform: Linux-4.15.0-108-generic-x86_64-with-Ubuntu-18.04-bionic
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: 
GPU 0: TITAN RTX

Nvidia driver version: 460.67
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.18.1
[pip3] pytorch-lightning==1.4.0.dev0
[pip3] torch==1.9.0
[pip3] torch-model-archiver==0.2.0
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchgeometry==0.1.2
[pip3] torchmetrics==0.3.2
[pip3] torchserve==0.4.0
[pip3] torchserve-dashboard==0.3.2
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.5.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] mkl                       2019.4                      243  
[conda] mkl-service               2.3.0            py37he904b0f_0  
[conda] mkl_fft                   1.0.14           py37ha843d7b_0  
[conda] mkl_random                1.1.0            py37hd6b4f25_0  
[conda] numpy                     1.17.2           py37haad9e8e_0  
[conda] numpy-base                1.17.2           py37hde5b4d6_0  
[conda] numpydoc                  0.9.1                      py_0

Server 1 results: (cuda x2 slower)

#run 1 python3 nvjpeg_bench.py 
cpu        mean: 2.071 ms, median: 2.071 ms, Throughput = 151.248 Megapixel / sec, 482.753 fps
cuda       mean: 4.988 ms, median: 4.988 ms, Throughput = 62.816 Megapixel / sec, 200.497 fps
#run 2
cpu        mean: 2.157 ms, median: 2.157 ms, Throughput = 145.254 Megapixel / sec, 463.624 fps
cuda       mean: 4.417 ms, median: 4.417 ms, Throughput = 70.937 Megapixel / sec, 226.417 fps
#run 3
cpu        mean: 2.182 ms, median: 2.182 ms, Throughput = 143.612 Megapixel / sec, 458.381 fps
cuda       mean: 3.836 ms, median: 3.836 ms, Throughput = 81.682 Megapixel / sec, 260.712 fps
#run 4
cpu        mean: 2.178 ms, median: 2.178 ms, Throughput = 143.874 Megapixel / sec, 459.217 fps
cuda       mean: 3.725 ms, median: 3.725 ms, Throughput = 84.108 Megapixel / sec, 268.455 fps

cuda 11.1 bug disappeared mysteriously, ipykernel must have been reconnecting to an old one despite restarts 🤷 I'll open a separate issue if I re-incounter & isolate it.

So I ran benchmarks also on an A100 with cuda 11.1

Server 2 ENV:

Collecting environment information...
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.4.0-70-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.1.105
GPU models and configuration: 
GPU 0: A100-PCIE-40GB

Nvidia driver version: 460.73.01
cuDNN version: Probably one of the following:
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.3
[pip3] torch==1.9.0+cu111
[pip3] torch-model-archiver==0.3.0b20210517
[pip3] torchaudio==0.9.0
[pip3] torchfile==0.1.0
[pip3] torchgeometry==0.1.2
[pip3] torchserve==0.3.0b20210517
[pip3] torchserve-dashboard==0.4.0
[pip3] torchtext==0.8.1
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

Server 2 results: (cuda x5 times slower)

#run 1 python3 nvjpeg_bench.py 
cpu        mean: 1.703 ms, median: 1.703 ms, Throughput = 184.022 Megapixel / sec, 587.362 fps
cuda       mean: 8.776 ms, median: 8.776 ms, Throughput = 35.701 Megapixel / sec, 113.949 fps
#run 2
cpu        mean: 1.765 ms, median: 1.765 ms, Throughput = 177.545 Megapixel / sec, 566.688 fps
cuda       mean: 8.709 ms, median: 8.709 ms, Throughput = 35.975 Megapixel / sec, 114.825 fps
#run 3
cpu        mean: 1.741 ms, median: 1.741 ms, Throughput = 179.986 Megapixel / sec, 574.481 fps
cuda       mean: 8.586 ms, median: 8.586 ms, Throughput = 36.492 Megapixel / sec, 116.474 fps
#run 4
cpu        mean: 1.735 ms, median: 1.735 ms, Throughput = 180.537 Megapixel / sec, 576.239 fps
cuda       mean: 8.950 ms, median: 8.950 ms, Throughput = 35.005 Megapixel / sec, 111.728 fps

(Nothing else was running on the gpu during benchmarks)

NicolasHug commented 3 years ago

Thanks for the details, we will keep that in mind

cuda 11.1 bug disappeared mysteriously, ipykernel must have been reconnecting to an old one despite restarts 🤷 I'll open a separate issue if I re-incounter & isolate it.

Sounds good!

So I ran benchmarks also on an A100 with cuda 11.1

Just note that wile the code runs on A100, we haven't implemented the full A100 support yet so we can't take advantage of the dedicated hardware instructions that the A100 has. We'll look into it in the future, and this is one of the items of this issue, but I don't have access to an A100 ATM.

cceyda commented 3 years ago

Just ran on collab and cuda is slightly faster... don't know what is wrong with my local setup :/

NicolasHug commented 1 month ago

I think most of these items have been addressed in https://github.com/pytorch/vision/pull/8496, so I'll close this issue. Feel free to open follow-up issues for any feedback on the jpeg GPU decoder

pytorch / vision

Potential improvements to jpeg decoding on GPU #3848