DRISE GPU optimizations for faster execution time during mask generation
This PR optimizes the mask generation code execution time to reduce DRISE explanation runtime. This optimization is at the per-instance level of explanations.
The main optimization is to generate the original mask directly on the GPU device if it is available instead of on CPU and then move it to GPU.
Attempts were also made to try and optimize the saliency_fusion step where the mask affinity records are used to calculate the saliency maps, but here it is much more tricky as the GPU can easily run out of memory for large images and the code needs to be careful not to move the matrices back and forth between CPU and GPU devices as it seems there is a large cost to this in execution time.
Note the images were resized in this notebook to be 10X larger with the code (in order to make this more comparable to a customer scenario with very large images):
for i, file in enumerate(os.listdir("./data/odFridgeObjects/" + "images")):
image_path = "./data/odFridgeObjects/" + "images" + "/" + file
img = Image.open(image_path)
basewidth = img.width * 10
wpercent = (basewidth/float(img.size[0]))
hsize = int((float(img.size[1])*float(wpercent)))
img = img.resize((basewidth,hsize), Image.Resampling.LANCZOS)
img.save(image_path)
data = data.append({ImageColumns.IMAGE.value: image_path,
ImageColumns.LABEL.value: labels[i]}, # folder
ignore_index=True)
if i > 3:
break
Also, the notebook was modified to move the model to GPU and use a GPU device:
Error Analysis
Current Status: Generating error analysis reports.
Current Status: Finished generating error analysis reports.
Time taken: 0.0 min 0.001796500000011747 sec
In the comparison, for the two images the mask generating and prediction step takes:
image1: 46 seconds
image2: 43 seconds
Without optimizations it took about 2X longer:
image1: 1 min 21 seconds
image2: 1 min 22 seconds
The saliency_fusion step, which was not optimized, takes about the same time with and without optimizations:
With optimizations:
image1: 1 min 49 seconds
image2: 1 min 47 seconds
Without optimizations (there is no change, since this code was not GPU optimized due to OOM):
image1: 1 min 57 seconds
image2: 1 min 30 seconds
Description
DRISE GPU optimizations for faster execution time during mask generation
This PR optimizes the mask generation code execution time to reduce DRISE explanation runtime. This optimization is at the per-instance level of explanations. The main optimization is to generate the original mask directly on the GPU device if it is available instead of on CPU and then move it to GPU. Attempts were also made to try and optimize the saliency_fusion step where the mask affinity records are used to calculate the saliency maps, but here it is much more tricky as the GPU can easily run out of memory for large images and the code needs to be careful not to move the matrices back and forth between CPU and GPU devices as it seems there is a large cost to this in execution time.
The optimizations were measured on the fridge object detection notebook locally: https://github.com/microsoft/responsible-ai-toolbox/blob/main/notebooks/responsibleaidashboard/responsibleaidashboard-fridge-object-detection-model-debugging.ipynb
Note the images were resized in this notebook to be 10X larger with the code (in order to make this more comparable to a customer scenario with very large images):
Also, the notebook was modified to move the model to GPU and use a GPU device:
Run with optimizations for PR during rai_insights.compute() step for two images:
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:46<00:00, 2.15it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99/99 [01:49<00:00, 1.10s/it] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:43<00:00, 2.30it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99/99 [01:47<00:00, 1.09s/it]
Error Analysis Current Status: Generating error analysis reports. Current Status: Finished generating error analysis reports. Time taken: 0.0 min 0.0055526000000440945 sec
As comparison, run without the optimizations:
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:21<00:00, 1.23it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99/99 [01:57<00:00, 1.19s/it] 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:22<00:00, 1.21it/s] 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99/99 [01:30<00:00, 1.09it/s]
Error Analysis Current Status: Generating error analysis reports. Current Status: Finished generating error analysis reports. Time taken: 0.0 min 0.001796500000011747 sec
In the comparison, for the two images the mask generating and prediction step takes: image1: 46 seconds image2: 43 seconds Without optimizations it took about 2X longer: image1: 1 min 21 seconds image2: 1 min 22 seconds
The saliency_fusion step, which was not optimized, takes about the same time with and without optimizations: With optimizations: image1: 1 min 49 seconds image2: 1 min 47 seconds Without optimizations (there is no change, since this code was not GPU optimized due to OOM): image1: 1 min 57 seconds image2: 1 min 30 seconds
Checklist