ultralytics / yolov5

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite
https://docs.ultralytics.com
GNU Affero General Public License v3.0
50k stars 16.16k forks source link

Constant time with AutoShape #8496

Open marcimarc1 opened 2 years ago

marcimarc1 commented 2 years ago

Search before asking

Question

Hi,

I wanted to process 4K Images with yolov5s and did some experiments regarding resolution and size.

I realized I get the best result of my network with images of input size 640x640 and splitting the 4k images, respectively.

For inference, I now did some experiments and I am not even close to 30 FPS when processing with AutoShape.

This is probably due to the preprossessing of AutoShape, yet I do not find a way around that (neither with multiprocessing/treading or anything I can currently think of). I am quite sure that the problem is in AutoShape and wanted to ask if you have any idea how to accelerate that.

My current experiments show that if you are passing a torch tensor through the network, inference time seems to be more constant.

`import torch import numpy import time

model = torch.hub.load('ultralytics/yolov5', "yolov5s", classes = 10) model.eval()

input_image = numpy.random.rand(3840, 2160,3) def splitimages(img): dh, dw, = img.shape image_splits =[] height, width = 640, 640 for i in range(dh//height): for j in range(dw//width): image_splits.append( (jwidth, iheight, (j+1)width, (i+1)height)) return image_splits

splits = split_images(input_image) warmup = model([input_image]) t = time.time() model([input_image]) print(f"Time for completing Single Frame: {time.time()-t}") cropped = [] for i in splits: cropped.append(input_image[i[1]:i[3],i[0]:i[2]])

t = time.time() model(cropped) print(f"Time for completing 18 Cropped Frames: {time.time()-t}")

inp = torch.rand(18,3,640,640) t = time.time() model(inp) print(f"Time for completing 18 Torch Frames: {time.time()-t}")` Result: ... Adding AutoShape... Time for completing Single Frame: 1.4259703159332275 Time for completing 18 Cropped Frames: 0.5461986064910889 Time for completing 18 Torch Frames: 0.027060985565185547

Additional

No response

github-actions[bot] commented 2 years ago

👋 Hello @marcimarc1, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email support@ultralytics.com.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on macOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher commented 2 years ago

@marcimarc1 torch inputs to AutoShape are not pre or post processed, i.e. this mode is only suitable for use in detect.py, val.py, train.py etc where the preprocessing and postprocessing (NMS) are handled externally.

marcimarc1 commented 2 years ago

Ok, this is good to know, thank you @glenn-jocher
Is there a way to accelerate batched NMS? I further investigated it, and identified it as the bottleneck. Also making the numpy arrays contiguous is necessary for fast inference in the model itself, but takes another 0.01s :/

I could get rid of another 0.04s by just getting rid of the unnecessary preprocessing in my case, current experiment status is: Time for completing 18 Autoshape Frames: 0.2034602165222168 preprocessing: 0.03889822959899902 inference: 0.018947839736938477 NMS: 0.10771417617797852 Time for completing 18 Cropped Frames: 0.16855311393737793

I would appreciate some suggestions.

glenn-jocher commented 2 years ago

@marcimarc1 AutoShape is already optimized for standard use cases, i.e. local/remote COCO images. If you find room for improvement please submit a PR to help all other YOLOv5 users. Standard outputs should look like this when loading directly from 640-size *.jpg images. NMS will be about 1-2ms per image, and scales with instance count.

Screen Shot 2022-07-08 at 12 28 15 PM
marcimarc1 commented 2 years ago

@glenn-jocher My optimization is just useful in this case due to the consistency of input. Therefore, I got rid of the Loop in AutoShapes forward function. Currently my inference looks like this (slimmed down version of the forward function):

` def bare_inference(cropped, model): start = time.time() pt = True p = next(model.parameters()) if pt else torch.zeros(1, device=model.device) # for device, type x = np.ascontiguousarray(np.array(cropped).transpose((0, 3, 1, 2))) # stack and BHWC to BCHW x = torch.from_numpy(x).to(p.device).type_as(p) / 255 # uint8 to fp16/32 print("preprocessing: ", time.time()-start)

Inference

start = time.time()
y =model(x)  # forward
print("inference: ", time.time()-start)
start = time.time()
# Post-process
y = non_max_suppression(y,max_det=300)  # NMS
print("NMS: ", time.time()-start)
return y 

` I currently try to fully understand the NMS to save some time there, too. If you have any tips, I would appreciate them.

glenn-jocher commented 2 years ago

@marcimarc1 got it. Yes if you have a narrowly defined use case you should be able to make some optimizations at the cost of less flexibility.

I'd recommend something like Spyder lineprofiler to find hotspots. The core function inside the YOLOv5 NMS function is just the torch nms function here. This probably uses the most time by itself. https://github.com/ultralytics/yolov5/blob/526e650553819dbff67897b9c752c4072e989823/utils/general.py#L867

marcimarc1 commented 2 years ago

For further notice, I tried to use batched_nms for the acceleration of NMS on multiple images ( hence saving a loop over proposals). It resulted in 18% acceleration in inference. I am grateful for any more suggestions a boiled down NMS for the yolo proposals can look like this:

` ---

bs = prediction.shape[0]  # batch size
nc = prediction.shape[2] - 5  # number of classes
xc = prediction[..., 4] > conf_thres  # candidates
output = [torch.zeros((0, 6 + nc), device=prediction.device)] * bs
x = prediction[xc]
idx = torch.where(xc==True)[0]
x[:, 5:] *= x[:, 4:5]
box = xywh2xyxy(x[:, :4])
conf, j = x[:, 5:].max(1, keepdim=True)
x = torch.cat((box, conf, j.float(), x[:,5:]), 1)[conf.view(-1) > conf_thres]
idx = idx[conf.view(-1) > conf_thres]
c = x[:, 5:6] * (0 if agnostic else max_wh)  # classes
boxes, scores = x[:, :4] + c, x[:, 4]  # boxes (offset by class), scores
i = torchvision.ops.batched_nms(boxes, scores, idx, iou_thres)  # NMS
for j in range(bs):
    output[j] = x[i][idx[i] == j]
return output

`

If there are any more ideas regarding this, I would like to contribute to further discussions :).

Any comments on torch multiprocessing? Would it make any sense to use it here? @glenn-jocher

glenn-jocher commented 2 years ago

@marcimarc1 oh interesting. In our tests (val.py on 5000 COCO val mages) batched_nms produced worse performance. I'll add a TODO to revisit this.

glenn-jocher commented 2 years ago

@marcimarc1 see previous profiling results with torchvision.ops.nms vs torchvision.ops.batched_nms in https://github.com/ultralytics/yolov5/issues/5261#issuecomment-958302136

image