rapidsai / cucim

cuCIM - RAPIDS GPU-accelerated image processing library
https://docs.rapids.ai/api/cucim/stable/
Apache License 2.0
356 stars 60 forks source link

[Design] Supporting Multithreading and Batch Processing #149

Closed gigony closed 2 years ago

gigony commented 3 years ago

Summary

Issue

cuCIM's read_region doesn't use multithreading and whether if using them is delegated to the user. And, to get better performance using nvJPEG/GDS, batch-processing is needed.

The following requirements are needed:

Note that batch processing would be supported only for a single (large) image with the list of start locations of the patch, not for a series of images.

Supporting this feature would help AI model training/inferencing workflows and integration with cuNumeric (e.g., loading a whole slide image into a single NumPy Array on GPU) in the future.

Related issues:

Decision

Example API Usages

The following parameters would be added in the read_region method:

Loading entire image by using multithreads

from cucim import CuImage

img = CuImage("input.tif")

region = img.read_region(level=1, num_workers=8)  # read whole image at level 1 using 8 workers

Loading batched image using multithreads

You can feed locations of the region through the list/tuple of locations or through the NumPy array of locations. (e.g., ((<x for loc 1>, <y for loc 1>), (<x for loc 2>, <y for loc 2>)])). Each element in the location should be int type (int64) and the dimension of the location should be equal to the dimension of the size. You can feed any iterator of locations (dimensions of the input don't matter, flattening the item in the iterator once if the item is also an iterator).

For example, you can feed the following iterator:

import numpy as np
from cucim import CuImage

cache = CuImage.cache("per_process", memory_capacity=1024)

img = CuImage("image.tif")

locations = [[0,   0], [100,   0], [200,   0], [300,   0],
             [0, 200], [100, 200], [200, 200], [300, 200]]
# locations = np.array(locations)

region = img.read_region(locations, (224, 224), batch_size=4, num_workers=8)

for batch in region:
    img = np.asarray(batch)
    print(img.shape)
    for item in img:
        print(item.shape)

# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)

Loading image using nvJPEG and cuFile (GDS, GPUDirect Storage)

If cuda argument is specified in device parameter of read_region() method, it uses nvJPEG with GPUDirect Storage to load images.

Use CuPy instead of Numpy, and Image Cache (CuImage.cache) wouldn't be used in the case.

import cupy as cp
from cucim import CuImage

img = CuImage("image.tif")

locations = [[0,   0], [100,   0], [200,   0], [300,   0],
             [0, 200], [100, 200], [200, 200], [300, 200]]
# locations = np.array(locations)

region = img.read_region(locations, (224, 224), batch_size=4, device="cuda")

for batch in region:
    img = cp.asarray(batch)
    print(img.shape)
    for item in img:
        print(item.shape)

# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)

Implementation Changes

Status

Done

Details

3rd party dependencies

TaskFlow

Tried to use concurrentqueue to implement a thread pool, but its performance was lesser than TaskFlow (Using concurrentqueue took about 3.3 to 3.8. However, using TaskFlow took about 3.2 consistently in my simple test).

CC: @drbeh @JHancox @Jlefman @rahul-imaging

JHancox commented 3 years ago

Couple of points to consider:I guess patch/batch shuffling would be done by the consumer rather than by cucim itself? Would the loading be done asynchronously/lazily with respect to the read_region call?

gigony commented 3 years ago

Thanks @JHancox for the feedback!

I got idea on added parameters from Pytorch DataLoader's parameters (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) read_region() would accept iterator for patch locations and construct an array of locations at c++ level in advance (or keeping a pointer to the location list), instead of calling next() for getting 'next location' because getting next locations from Python requires acquiring GIL so performance would be reduced.

I guess patch/batch shuffling would be done by the consumer rather than by cucim itself

Yes for now because I thought patch/batch shuffling is a matter of providing a shuffled list of locations from user side. If providing shuffle parameter is meaningful, let us know, and I can add the feature later.

Would the loading be done asynchronously/lazily with respect to the read_region call?

If a list (or iterator) of locations are provided, read_region() would return an iterator object, and loading next batches is asynchronously done.

region = img.read_region(locations, (224, 224), batch_size=4, num_workers=8)
for batch in region:
    (while the user can access `batch` image here, the next batch(es) would be loaded in the background for prefetching and available in the next iteration in `for` loop).
JHancox commented 3 years ago

One other point regarding shuffling - it could be a good way of not always discarding the same patches when the batch size does not divide perfectly into the image. I guess you could implement another mechanism to randomize which patches are left out.

gigony commented 2 years ago

This feature is available in v22.02.00: https://github.com/rapidsai/cucim/wiki/release_notes_v22.02.00