Closed gigony closed 2 years ago
Couple of points to consider:I guess patch/batch shuffling would be done by the consumer rather than by cucim itself? Would the loading be done asynchronously/lazily with respect to the read_region call?
Thanks @JHancox for the feedback!
I got idea on added parameters from Pytorch DataLoader's parameters (https://pytorch.org/docs/stable/_modules/torch/utils/data/dataloader.html#DataLoader)
read_region() would accept iterator for patch locations
and construct an array of locations
at c++ level in advance (or keeping a pointer to the location list), instead of calling next()
for getting 'next location' because getting next locations from Python requires acquiring GIL so performance would be reduced.
I guess patch/batch shuffling would be done by the consumer rather than by cucim itself
Yes for now because I thought patch/batch shuffling is a matter of providing a shuffled list of locations from user side.
If providing shuffle
parameter is meaningful, let us know, and I can add the feature later.
Would the loading be done asynchronously/lazily with respect to the read_region call?
If a list (or iterator) of locations are provided, read_region()
would return an iterator object, and loading next batches is asynchronously done.
region = img.read_region(locations, (224, 224), batch_size=4, num_workers=8)
for batch in region:
(while the user can access `batch` image here, the next batch(es) would be loaded in the background for prefetching and available in the next iteration in `for` loop).
One other point regarding shuffling - it could be a good way of not always discarding the same patches when the batch size does not divide perfectly into the image. I guess you could implement another mechanism to randomize which patches are left out.
This feature is available in v22.02.00: https://github.com/rapidsai/cucim/wiki/release_notes_v22.02.00
Summary
Issue
cuCIM's read_region doesn't use multithreading and whether if using them is delegated to the user. And, to get better performance using nvJPEG/GDS, batch-processing is needed.
The following requirements are needed:
Note that batch processing would be supported only for a single (large) image with the list of start locations of the patch, not for a series of images.
Supporting this feature would help AI model training/inferencing workflows and integration with cuNumeric (e.g., loading a whole slide image into a single NumPy Array on GPU) in the future.
Related issues:
Decision
Example API Usages
The following parameters would be added in the
read_region
method:num_workers
: number of workers(threads) to use for loading the image. (default:1
)batch_size
: number of images to load at once. (default:1
)drop_last
: whether to drop the last batch if the batch size is not divisible by the number of images. (default:False
)preferch_factor
: number of samples loaded in advance by each worker. (default:2
)shuffle
: whether to shuffle the input locations (default:False
)seed
: seed value for random value generation (default: 0)Loading entire image by using multithreads
Loading batched image using multithreads
You can feed locations of the region through the list/tuple of locations or through the NumPy array of locations. (e.g.,
((<x for loc 1>, <y for loc 1>), (<x for loc 2>, <y for loc 2>)])
). Each element in the location should be int type (int64) and the dimension of the location should be equal to the dimension of the size. You can feed any iterator of locations (dimensions of the input don't matter, flattening the item in the iterator once if the item is also an iterator).For example, you can feed the following iterator:
[0, 0, 100, 0]
or(0, 0, 100, 0)
would be interpreted as a list of(0, 0)
and(100, 0)
.((sx, sy) for sy in range(0, height, patch_size) for sx in range(0, width, patch_size))
would iterate over the locations of the patches.[(0, 100), (0, 200)]
would be interpreted as a list of(0, 0)
and(100, 0)
.np.array(((0, 100), (0, 200)))
ornp.array((0, 100, 0, 200))
would be also available and using Numpy array object would be faster than using python list/tuple.Loading image using nvJPEG and cuFile (GDS, GPUDirect Storage)
If
cuda
argument is specified indevice
parameter ofread_region()
method, it uses nvJPEG with GPUDirect Storage to load images.Use CuPy instead of Numpy, and Image Cache (
CuImage.cache
) wouldn't be used in the case.Implementation Changes
cuda
.Status
Done
Details
3rd party dependencies
TaskFlow
Tried to use concurrentqueue to implement a thread pool, but its performance was lesser than TaskFlow (Using concurrentqueue took about 3.3 to 3.8. However, using TaskFlow took about 3.2 consistently in my simple test).
CC: @drbeh @JHancox @Jlefman @rahul-imaging