[Design] Supporting Multithreading and Batch Processing

gigony commented 3 years ago

Summary

Issue

cuCIM's read_region doesn't use multithreading and whether if using them is delegated to the user. And, to get better performance using nvJPEG/GDS, batch-processing is needed.

The following requirements are needed:

Support multithreading
- Use multi-threading to load the entire image.
Support batch processing
- Batch-loading of patches from a huge image for AI model training/inferencing use cases.
- Integration with nvJPEG&GDS:
- Design and Implement image patch generator by loading raw compressed images at once using GDS and providing a series of image patches by decoding multiple tile images in batch using nvJPEG

Note that batch processing would be supported only for a single (large) image with the list of start locations of the patch, not for a series of images.

Supporting this feature would help AI model training/inferencing workflows and integration with cuNumeric (e.g., loading a whole slide image into a single NumPy Array on GPU) in the future.

Related issues:

Decision

Example API Usages

The following parameters would be added in the read_region method:

num_workers: number of workers(threads) to use for loading the image. (default: 1)
batch_size: number of images to load at once. (default: 1)
drop_last: whether to drop the last batch if the batch size is not divisible by the number of images. (default: False)
preferch_factor: number of samples loaded in advance by each worker. (default: 2)
shuffle: whether to shuffle the input locations (default: False)
seed: seed value for random value generation (default: 0)

Loading entire image by using multithreads

from cucim import CuImage

img = CuImage("input.tif")

region = img.read_region(level=1, num_workers=8)  # read whole image at level 1 using 8 workers

Loading batched image using multithreads

You can feed locations of the region through the list/tuple of locations or through the NumPy array of locations. (e.g., ((<x for loc 1>, <y for loc 1>), (<x for loc 2>, <y for loc 2>)])). Each element in the location should be int type (int64) and the dimension of the location should be equal to the dimension of the size. You can feed any iterator of locations (dimensions of the input don't matter, flattening the item in the iterator once if the item is also an iterator).

For example, you can feed the following iterator:

[0, 0, 100, 0] or (0, 0, 100, 0) would be interpreted as a list of (0, 0) and (100, 0).
((sx, sy) for sy in range(0, height, patch_size) for sx in range(0, width, patch_size)) would iterate over the locations of the patches.
[(0, 100), (0, 200)] would be interpreted as a list of (0, 0) and (100, 0).
Numpy array such as np.array(((0, 100), (0, 200))) or np.array((0, 100, 0, 200)) would be also available and using Numpy array object would be faster than using python list/tuple.

import numpy as np
from cucim import CuImage

cache = CuImage.cache("per_process", memory_capacity=1024)

img = CuImage("image.tif")

locations = [[0,   0], [100,   0], [200,   0], [300,   0],
             [0, 200], [100, 200], [200, 200], [300, 200]]
# locations = np.array(locations)

region = img.read_region(locations, (224, 224), batch_size=4, num_workers=8)

for batch in region:
    img = np.asarray(batch)
    print(img.shape)
    for item in img:
        print(item.shape)

# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)

Loading image using nvJPEG and cuFile (GDS, GPUDirect Storage)

If cuda argument is specified in device parameter of read_region() method, it uses nvJPEG with GPUDirect Storage to load images.

Use CuPy instead of Numpy, and Image Cache (CuImage.cache) wouldn't be used in the case.

import cupy as cp
from cucim import CuImage

img = CuImage("image.tif")

locations = [[0,   0], [100,   0], [200,   0], [300,   0],
             [0, 200], [100, 200], [200, 200], [300, 200]]
# locations = np.array(locations)

region = img.read_region(locations, (224, 224), batch_size=4, device="cuda")

for batch in region:
    img = cp.asarray(batch)
    print(img.shape)
    for item in img:
        print(item.shape)

# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (4, 224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)
# (224, 224, 3)

Implementation Changes

Make CuCIMFileHandle sharable among multiple CuImage instances.
Implement a buffered loader to load images in a batch manner.
Use nvJPEG to decode JPEG-compressed image if device parameter to read_region() API is cuda.
Clean up some code

Status

Done

Details

3rd party dependencies

TaskFlow

Tried to use concurrentqueue to implement a thread pool, but its performance was lesser than TaskFlow (Using concurrentqueue took about 3.3 to 3.8. However, using TaskFlow took about 3.2 consistently in my simple test).

CC: @drbeh @JHancox @Jlefman @rahul-imaging

JHancox commented 3 years ago

Couple of points to consider:I guess patch/batch shuffling would be done by the consumer rather than by cucim itself? Would the loading be done asynchronously/lazily with respect to the read_region call?

gigony commented 3 years ago

Thanks @JHancox for the feedback!

I got idea on added parameters from Pytorch DataLoader's parameters (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader) read_region() would accept iterator for patch locations and construct an array of locations at c++ level in advance (or keeping a pointer to the location list), instead of calling next() for getting 'next location' because getting next locations from Python requires acquiring GIL so performance would be reduced.

I guess patch/batch shuffling would be done by the consumer rather than by cucim itself

Yes for now because I thought patch/batch shuffling is a matter of providing a shuffled list of locations from user side. If providing shuffle parameter is meaningful, let us know, and I can add the feature later.

Would the loading be done asynchronously/lazily with respect to the read_region call?

If a list (or iterator) of locations are provided, read_region() would return an iterator object, and loading next batches is asynchronously done.

region = img.read_region(locations, (224, 224), batch_size=4, num_workers=8)
for batch in region:
    (while the user can access `batch` image here, the next batch(es) would be loaded in the background for prefetching and available in the next iteration in `for` loop).

JHancox commented 3 years ago

One other point regarding shuffling - it could be a good way of not always discarding the same patches when the batch size does not divide perfectly into the image. I guess you could implement another mechanism to randomize which patches are left out.

gigony commented 2 years ago

This feature is available in v22.02.00: https://github.com/rapidsai/cucim/wiki/release_notes_v22.02.00

rapidsai / cucim