paperswithcode / torchbench

Easily benchmark machine learning models in PyTorch
Apache License 2.0
145 stars 20 forks source link

'Speed' measurement possibly misleading for image classification #15

Open mehtadushy opened 4 years ago

mehtadushy commented 4 years ago

If I am correct in assuming that the time calculated for evaluating the whole dataset (https://github.com/paperswithcode/torchbench/blob/master/torchbench/image_classification/utils.py#L75) is used for calculating the speed on the leaderboard, then I would like to point out several issues with this

1) The disk read time gets included in the measurement, and models that are fast enough that the disk speed cannot keep up would report an unfairly low speed. 2) If consistent disk speeds are not ensured between runs (because some other process happened to be accessing the same disk), then it further compounds (1) above, and the evaluated speed would not be the same between runs.

I believe that the speed measurement should be done on a chunk of preloaded dummy data, with a note on the leaderboard saying that the actual speeds people can get in practice would depend on the rate at which the model can be fed.

mehtadushy commented 4 years ago

In my opinion this is a serious enough bug (if I am correct in my assumptions) to warrant putting a note on the leaderboard saying that the speed values should not be trusted until this is resolved. I only checked for image classification, but perhaps such an issue exists for other tasks too.

mehtadushy commented 4 years ago

and to add one more point to this: The intention of the 'speed' column is to indicate the intended throughput, but even when measured by factoring out the disk read speeds, throughput has a non trivial dependence on batch size.

rstojnic commented 4 years ago

Thank you for, as always, very thorough and thoughtful comments Dushyant!

To address your individual points:

1) You are absolutely correct that reading the image from the disk is included in the measurement. This is indeed problematic for very fast models and introduces an upper limit to the max speed. However, looking at the graph at https://sotabench.com/benchmarks/image-classification-on-imagenet it looks like this doesn't affect the models we have benchmarked right now. If we would be hitting the "ceiling" in speed many models would form a near-perfect vertical line on that graph, which we don't observe. Therefore, although speed is data loading+inference, we still feel it's useful at least for relative ranking of methods. I would welcome a PR to solve this problem for future super-fast models as well though! 2) Consistency of disk speed between runs. On sotabench.com we run the benchmark on a dedicated, freshly spun-up, isolated machine so in our experience in that setting there isn't really much variation. E.g. running a box-standard ResNeXt-101-32x8d model 3 times yields following speeds: (288.4, 286.0, 288.2). So although there is some variability, it seems to be only around 1-2%. Given that the range of speeds we are seeing is 6 to 500, we didn't feel this is significant enough. Furthermore, I feel that the size of the green dot in that graph probably captures a rough estimate of this variability. 3) Batch size. You are absolutely correct - changing the batch size will drastically change the speed. And indeed, some of the "small" models get their improvements by enabling larger batch sizes and we assume authors have optimized the batch size to make max usage of GPU memory. Alternatively, one could just send a single image at the time, which will make the benchmarks run longer but remove the effect of the batch. However, as many people do use these methods in batches it's not clear to me which one is better.

To add a few more caveats here: 1) We are testing everything on a specific machine, with a specific GPU and GPU memory size. It's possible that on a different GPU type you would get different results. 2) I say we assume that the batch size has been optimized by the author, but that's sometimes not the case and we don't have a way of checking if they've done this or not. 3) For repositories that test multiple models it's unclear how order influences the performance because some of the disk content might get cached differently after being repeatedly accessed. Same might hold for GPU cache as well. 4) The same model might be implemented in various frameworks, in various versions of those frameworks and with various subtle differences. Each of these might make a difference, so one needs to be careful about generalizing the results to "models" as abstract entities and instead only talk about using a model implemented in a specific way, in a specific environment, under a specific use case - in our case in batch-reading imagenet from the disk, on a specific hardware and a specific set of dependencies.

mehtadushy commented 4 years ago

Hi Robert

Thanks for the detailed response.

I can work on a patch to also report the loading + inference breakdown, and perhaps code to try different batch sizes and report the maximum speed obtained.

Also, it may be helpful to list the caveats you mention here, on the website as well, along with information of what GPU the inference would run on.

Best,