pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 151 forks source link

Weird behaviour of `InMemoryCacheHolder` not really speeding things up #938

Closed FrancescoSaverioZuppichini closed 1 year ago

FrancescoSaverioZuppichini commented 1 year ago

🐛 Describe the bug

Weird behaviour of InMemoryCacheHolder not really speeding things up

First iteration took 9s, all the others 4s. Why? Shouldn't it be cached?

# download camvid and place it here
import torchdata.datapipes.iter as pipes
from pathlib import Path
from torchvision.io import read_image
from torch.utils.data import DataLoader
from time import perf_counter
from PIL import Image

dataset_dir = Path('./camvid')

pipe = pipes.Zipper(
    pipes.FileLister([dataset_dir / "images"], masks='*png'),
).map(lambda x: (read_image(x[0])))

pipe = pipes.InMemoryCacheHolder(pipe, size=32000).sharding_filter() # 8GB
dl = DataLoader(pipe, batch_size=32, num_workers=8, persistent_workers=True, prefetch_factor=2)

for i in range(10):
    start = perf_counter()
    for data in dl:
        # print(image.shape)
        continue

    print(f"[{i}]Elapsed {perf_counter() - start: .2f}")

Output

[0]Elapsed  18.8
[1]Elapsed  4.41
[2]Elapsed  4.47
[3]Elapsed  4.75
[4]Elapsed  4.53
[5]Elapsed  4.41
[6]Elapsed  4.38
[7]Elapsed  4.41
[8]Elapsed  4.41
[9]Elapsed  4.41

If I set num_workers=1, the first iteration is faster, and then all the others are the same

If I use .batch(32), useless in RL since to my understand I need more workers to prepare the next batches, I see a speed up

...
pipe = pipes.Zipper(
    pipes.FileLister([dataset_dir / "images"], masks='*png'),
).map(lambda x: (read_image(x[0])))

pipe = pipes.InMemoryCacheHolder(pipe, size=32000).batch(32) # 8GB

for i in range(10):
    start = perf_counter()
    for data in pipe:
        # print(image.shape)
        continue

    print(f"[{i}]Elapsed {perf_counter() - start: .2f}")
[0]Elapsed  15.99
[1]Elapsed  0.03
[2]Elapsed  0.03
[3]Elapsed  0.03
[4]Elapsed  0.03
[5]Elapsed  0.03
[6]Elapsed  0.03
[7]Elapsed  0.03
[8]Elapsed  0.03
[9]Elapsed  0.03

Thanks!

Versions

Collecting environment information...
PyTorch version: 1.13.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47)  [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-16GB
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] torch==1.13.1
[pip3] torchdata==0.5.1
[pip3] torchvision==0.14.1
[conda] Could not collect
ejguan commented 1 year ago

I think the result has shown the perf boost by caching.

First iteration took 9s, all the others 4s. Why? Shouldn't it be cached?

Your pipeline is relatively simple, so I think the major overhead is the data passed between main process and worker processes. So, you won't be able to observe the perf gain that significantly when running everything in the main process

FrancescoSaverioZuppichini commented 1 year ago

@ejguan any way for me to check it? using ffcv resulted in a similar speed as the second snipper

ejguan commented 1 year ago

@FrancescoSaverioZuppichini Sorry, what do you mean? I feel your pipeline is better to use DataLoader with 0 workers to get rid of multiprocessing pieces.

FrancescoSaverioZuppichini commented 1 year ago

Don't you need multiple workers to speed things up and preload batches to GPU?

ejguan commented 1 year ago

It's all about tradeoff, right? Your current pipeline suffers more due to multiprocessing rather than getting benefit from it. And, multi-worker won't help to preload batches to GPU. For DataLoader, turning pin_memory=True would help to move batches to a shared memory first. Then, it will be minimum cost to move the Tensor from the shared memory to GPU.

FrancescoSaverioZuppichini commented 1 year ago

@ejguan sure, I don't have any opinion about best practices here I am just wondering what's the best way to do things. Python docs suggest dataloader + multi workers is the way to go so I'd like to know if I should apply the same approach with torchdata.

I was wondering if you may give me a little more context about And, multi-worker won't help to preload batches to GPU.

ejguan commented 1 year ago

multi-worker won't help to preload batches to GPU.

DataLoader won't move data to GPU no matter with multiple workers or not. And, for TorchData, we do have a plan to support GPU operations in the future. But, it's still under discussion.

FrancescoSaverioZuppichini commented 1 year ago

Thanks a lot for the highlights!

ejguan commented 1 year ago

Closing it for now. For GPU operations, we have an opened issue https://github.com/pytorch/data/issues/761 already

Please feel free to reopen it if you have further issue on the same topic.