How's speed comparasion on batch size 64 or input resolution up to 800w input?

lucasjinreal commented 3 years ago

AlexeyAB commented 3 years ago

ResNet50/101 or YOLOv4/YOLOR are more suitable for processing large batches at high FPS, while shallow networks are more suitable for low Latency. Batch processing cannot reduce latency in any way, because the latency of an individual sample in a batch cannot be lower than the latency of the entire batch, and the larger the batch the higher the latency. We are aiming for low latency in real-time (robots, autopilots, ...), rather than high FPS that can be improved by using large batch size for offline processing.
Non-deep Networks with higher resolution 896x896 are faster and more accurate than Deep Netowkrs at lower resolution 512x512 on modern GPUs. Increasing the resolution of the network is one way to make the network wider rather than deeper. On future GPUs with a large number of transistors/computational elements at the same frequency, to improve accuracy we can significantly increase the width of the layers (resolution, number of channels, kernel size, ...) without losing speed, while we can't increase network depth without losing speeed. The problem of deep networks is a low network width that can be parallelized across cores, and a very large network depth that can only be executed sequentially. And there are 2 conflicting trends, in each new generation of GPU there are more computing elements without a noticeable increase in frequency, and in each new generation of neural network there are more and more layers that require higher frequencies for their sequential execution.

lucasjinreal commented 3 years ago

@AlexeyAB thanks for your reply.

The reason why I ask comparasion on larger batchsize is that, NonDeep might faster then traditional deep models on single batch, but from your paper, NonDeep models are 5x params then Deep models which is nonnegligible. So wonder if there any experiments shows, Nondeep faster than resnets with the benifit of parallel compuation on single batch will it faster as well on multi-batch? Theoritically infer, parallel compuation strength will have a very quick upper bounds once you requires more mem than Deep models;
896x896 in non-deep faster than 512x512 in deep, does it suitable for all generate of GPUs? As I mentioned in point (1), parallel strength will quickly have an upper bounds, if 896x896 faster than 512x512, will 3048x2333 in Non-deep faster than 2048x1980 in deep (or, even it will still faster, if input resolution all 3048x2333?

AlexeyAB commented 3 years ago

@jinfagang Thanks for good questions!

Currently Deep networks with higher batch size and resolution have higher FPS than Non-deep networks, but higher batch size doesn't reduce Latency. But real-time systems like self-driving cars and robots require exactly low Latency, not high FPS.

For both questions - yes, parallel strength will quickly have an upper bounds, but the newer the GPU, the more cores, the higher this boundary will shift, and Deep networks will need higher and higher batch-size and resolution to outperform Non-deep networks. So at some point in time for the future GPUs the resolution 3048x2333 in Non-deep will be faster than 2048x1980 in Deep. And the higher batch size you will need to use for Deep networks.

400 MB (Non-deep) vs 100 MB (Deep) for params isn't a big issue, since this difference 300 MB is ~10% for Jetson Nanon 4GB and ~1% for RTX 3090 24GB, this doesn't even allow us to make the batch size 2x times larger for Deep networks than for Non-deep ones. And the higher batch size, the more GPU memory we need to use for layer outputs, but the same amount of memory we need to use for params.

lucasjinreal commented 3 years ago

@AlexeyAB thank u, wider models might a very good direction to explore. I still wonder why did say this:

require exactly low Latency, not high FPS

Isn't that low latency (not params, not flops, not macs) is the final way to compare model speed? I thought it was same with FPS, this is good to compare 2 models in same device, since some models has fewer flops but they actually slower (might not optimized, or maybe not suitable for parallel, or maybe need more macs), so that final run time is the golden rule to judege a model speed (on same device). Isn't just one thing betwen low latency and fps?

AlexeyAB commented 3 years ago

@jinfagang I think that Latency_batch1 (and the same FPS_batch1) is the most important metric for comparing the speed of neural networks, because Latency = 1000ms / FPS is true only for batch=1.

For example, YOLOv5 shows latency 20ms for batch=32, while actually latency is 20ms * 32 samples = 640ms plus ~1000ms to get 32 frames from a video-camera in real projects. They just process batch=32 for 640ms then divide this time by 32 and show 20ms while this is not true.

For comparing models on current devices, we have metrics (but we need to test model on at least 3 devices: mobile-GPU/NPU, embedded-GPU, high-end-GPU):

FPS (batch > 1) - for offline batch processing
Latency (batch == 1) - for online real-time processing

There are some metrics which important

for practice: FPS, Latency, ... - but they depend on device
for theory: Params, FLOPs/MACs, ... - but they don't say anything about the possibility of parallelization and the need for memory bandwidth, while weak parallelization and high memory bandwidth requirements can slow down the speed many times

For the end user, only Accuracy/FPS or Accuracy/Latency_batch1 is important, and if the model has 2 times less flops, but at the same time 2x times slower, then user will not use it.

imankgoyal / NonDeepNetworks

How's speed comparasion on batch size 64 or input resolution up to 800w input? #1