mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks
https://mlcommons.org/en/groups/inference
Apache License 2.0
1.21k stars 527 forks source link

ImageNet accuracy 73.87% (should be 76.47%) #201

Closed profvjreddi closed 5 years ago

profvjreddi commented 5 years ago

For our image classification models, we are not hitting the reported ImageNet accuracy of 76%. We are at around 73.87% in our image classification and object detection README.

Possible reasons are pre-processing, or ... (something else).

@itayhubara and @guschmue @parvizp

sjain-stanford commented 5 years ago

I ran validation on mlperf's resnet50 frozen model (tensorflow) from here, and the accuracy seems alright (~76.5%). This is the pre-processing I used.

Val:    [   100/   782] Time   0.084 (  0.113)  Prec@1  73.438 ( 76.795)        Prec@5  93.750 ( 93.007)
Val:    [   200/   782] Time   0.202 (  0.109)  Prec@1  81.250 ( 77.029)        Prec@5  95.312 ( 93.377)
Val:    [   300/   782] Time   0.198 (  0.136)  Prec@1  79.688 ( 76.905)        Prec@5  93.750 ( 93.304)
Val:    [   400/   782] Time   0.181 (  0.152)  Prec@1  71.875 ( 76.870)        Prec@5  92.188 ( 93.173)
Val:    [   500/   782] Time   0.151 (  0.160)  Prec@1  78.125 ( 76.771)        Prec@5  93.750 ( 93.083)
Val:    [   600/   782] Time   0.198 (  0.166)  Prec@1  75.000 ( 76.583)        Prec@5  92.188 ( 93.051)
Val:    [   700/   782] Time   0.339 (  0.168)  Prec@1  82.812 ( 76.516)        Prec@5  92.188 ( 93.039)
Val:    [   781/   782] Time   0.265 (  0.168)  Prec@1  68.750 ( 76.522)        Prec@5  87.500 ( 93.066)
 model_dir=/data/resnet50_v1.pb prec@1=76.522 prec@5=93.066              
psyhtest commented 5 years ago

Wow, thanks for sharing! The only problem is that the resize code itself uses TF. If we want to use it on platforms where TF is not well supported (e.g. Arm-based dev boards), we should convert it to a script using standard Python image processing libraries.

itayhubara commented 5 years ago

According to Sambhav, the model is o.k so you probably need to change the preprocess and in specific the way you resize the image. If I am not mistaken all you should do is the following change img.resize((w,h)) -> img.resize((w,h), PIL.Image.BILINEAR )

As PIL default resizing method is "nearest". It seems to work for me at least on the default 500 images (from 72.6->74). Please let me know if this works for you as well.

Best, Itay

On Wed, Jun 19, 2019 at 5:45 PM Anton Lokhmotov notifications@github.com wrote:

Wow, thanks for sharing! The only problem is that the resize code itself uses TF. If we want to use it on platforms where TF is not well supported (e.g. Arm-based dev boards), we should convert it to a script using standard Python image processing libraries.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlperf%2Finference%2Fissues%2F201%3Femail_source%3Dnotifications%26email_token%3DABTRU3ETKC6QK3OUJK56SKLP3LHIVA5CNFSM4HZL3UNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYDVKOY%23issuecomment-503797051&data=02%7C01%7Citayh%40campus.technion.ac.il%7Ce4bdb3815d4b4535497508d6f51887e5%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636965883049117343&sdata=F%2BNslrAmtiUlg3AE2zOswO70CAzFb2kyME8VO2gA3mQ%3D&reserved=0, or mute the thread https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABTRU3HZALJG2A4IU2VXNOTP3LHIVANCNFSM4HZL3UNA&data=02%7C01%7Citayh%40campus.technion.ac.il%7Ce4bdb3815d4b4535497508d6f51887e5%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636965883049117343&sdata=sf2pqYBs6klvEeckKYoF6Re3zMaUX8qPbjOllcGacSU%3D&reserved=0 .

profvjreddi commented 5 years ago

@guschmue

guschmue commented 5 years ago

its the PIL.Image.BILINEAR , just fixing it.

itayhubara commented 5 years ago

Great!! I knew we can fix it :)

On Thu, Jun 20, 2019 at 10:29 AM Guenther Schmuelling < notifications@github.com> wrote:

its the PIL.Image.BILINEAR , just fixing it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlperf%2Finference%2Fissues%2F201%3Femail_source%3Dnotifications%26email_token%3DABTRU3APFMQDS4ITWQV5ESLP3O475A5CNFSM4HZL3UNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGCWDY%23issuecomment-504113935&data=02%7C01%7Citayh%40campus.technion.ac.il%7C6eb1319327a646e6f38e08d6f5a4dda2%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636966485774037985&sdata=S5cHU7vJ6wALCq3vC%2FT2NgtXqA0qDXT6%2FxxEHMwGbnU%3D&reserved=0, or mute the thread https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABTRU3HZALJZ2RM2DU5T6JTP3O475ANCNFSM4HZL3UNA&data=02%7C01%7Citayh%40campus.technion.ac.il%7C6eb1319327a646e6f38e08d6f5a4dda2%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636966485774037985&sdata=ExiuWAvdsLbl154oD17hj3r5dWENOQ2Q%2B8%2BP7gjW0qc%3D&reserved=0 .

guschmue commented 5 years ago

74.97 for resnet50, should be 76.4 ... one more fix to find.

itayhubara commented 5 years ago

Sadly I think this is the best we can do if we want to work with PIL. Apparently, TF is not consistent with pytorch, PIL, and skimage when it comes to aligning corners (as described for example here https://github.com/pytorch/pytorch/issues/10604 Thus the best idea is to rewrite it in TF...

Itay

On Thu, Jun 20, 2019 at 12:29 PM Guenther Schmuelling < notifications@github.com> wrote:

74.97 for resnet50, should be 76.4 ... one more fix to find.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fmlperf%2Finference%2Fissues%2F201%3Femail_source%3Dnotifications%26email_token%3DABTRU3EBXT2SSX6FHUMHHULP3PLA5A5CNFSM4HZL3UNKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYGMGCA%23issuecomment-504152840&data=02%7C01%7Citayh%40campus.technion.ac.il%7C477bc38a52d145c73dcb08d6f5b59798%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636966557638901890&sdata=c9tMAvTft89zfHbKH%2BV6ZzosPJtrZESx%2B0HiIwbz7hE%3D&reserved=0, or mute the thread https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABTRU3ANFGMJ5YET6ROTHRLP3PLA5ANCNFSM4HZL3UNA&data=02%7C01%7Citayh%40campus.technion.ac.il%7C477bc38a52d145c73dcb08d6f5b59798%7Cf1502c4cee2e411c9715c855f6753b84%7C1%7C0%7C636966557638911884&sdata=8kwz694RBVfhFtu8Hc2ki91SL9xGDUJw9vUz2PwnEUQ%3D&reserved=0 .

guschmue commented 5 years ago

https://github.com/mlperf/inference/pull/206. I'll look some more.

ens-lg4 commented 5 years ago

Dear @sjain-stanford, could you please share the code snippet of how you are using the vgg_preprocessing ? Exactly the call to the preprocessor ( preprocess_image() or preprocess_for_eval() ) with the parameters and any other manipulations before or after that call, if you do them (normalizations, flips, channel swapping, etc).

I am asking because even with your confirmation we are still struggling to get the 76.47% accuracy, so would like to reproduce your solution as close as possible.

Also, where does the magic number 782 come from? Did you choose to test on that many examples? If so, why?

Thank you very much in advance!

sjain-stanford commented 5 years ago

Hey @ens-lg4

Dear @sjain-stanford, could you please share the code snippet of how you are using the vgg_preprocessing ? Exactly the call to the preprocessor ( preprocess_image() or preprocess_for_eval() ) with the parameters and any other manipulations before or after that call, if you do them (normalizations, flips, channel swapping, etc).

I am asking because even with your confirmation we are still struggling to get the 76.47% accuracy, so would like to reproduce your solution as close as possible.

Sure. I use this validation script and this imagenet preprocessing utils script.

Here is the list of modifications I did to make it work with the frozen resnet50 from mlperf.

  1. Rename resnet50_v1.pb to resnet50_v1_frozen.pb (tells the script it's a frozen graph, so don't load a ckpt)

  2. Add the input and output tensor names to the validation script after line 179, comment out ValueError.

    input = g.get_tensor_by_name("input_tensor:0")
    output = g.get_tensor_by_name("softmax_tensor:0")
  3. Add the preprocessing fn to the utils script after line 576, comment out ValueError.

    image = vgg_preprocess_input_fn(image, image_size, image_size, is_training)
  4. Run validation (assuming you have imagenet tf-records somewhere)

    python utils/validate_imagenet_tf.py --data_dir ../imagenet/tf-records/ --model_dir /data/resnet50_v1_frozen.pb

    Output:

    Val:    [     0/   782] Time   3.588 (  3.588)  Prec@1  81.250 ( 81.250)        Prec@5  95.312 ( 95.312)
    Val:    [   100/   782] Time   0.079 (  0.115)  Prec@1  73.438 ( 76.795)        Prec@5  93.750 ( 93.007)
    Val:    [   200/   782] Time   0.080 (  0.098)  Prec@1  81.250 ( 77.029)        Prec@5  95.312 ( 93.377)
    Val:    [   300/   782] Time   0.079 (  0.092)  Prec@1  79.688 ( 76.905)        Prec@5  93.750 ( 93.304)
    Val:    [   400/   782] Time   0.080 (  0.089)  Prec@1  71.875 ( 76.870)        Prec@5  92.188 ( 93.173)
    Val:    [   500/   782] Time   0.080 (  0.087)  Prec@1  78.125 ( 76.771)        Prec@5  93.750 ( 93.083)
    Val:    [   600/   782] Time   0.080 (  0.086)  Prec@1  75.000 ( 76.583)        Prec@5  92.188 ( 93.051)
    Val:    [   700/   782] Time   0.080 (  0.085)  Prec@1  82.812 ( 76.516)        Prec@5  92.188 ( 93.039)
    Val:    [   781/   782] Time   0.270 (  0.085)  Prec@1  68.750 ( 76.522)        Prec@5  87.500 ( 93.066)
    model_dir=/data/resnet50_v1_frozen.pb prec@1=76.522 prec@5=93.066

Also, where does the magic number 782 come from? Did you choose to test on that many examples? If so, why?

I used a batch of 64, so there would be ~782 batches to go over 50k validation images. Thank you very much in advance!

ens-lg4 commented 5 years ago

@sjain-stanford , thank you very much for such a detailed answer!

psyhtest commented 5 years ago

Thanks @sjain-stanford! With that preprocessing, @ens-lg4 has now measured:

Accuracy top 1: 0.76522 (38261 of 50000)
Accuracy top 5: 0.93066 (46533 of 50000)

or exactly the same as you!

This is great news. However, the next big question is which preprocessing we should use for MobileNet?

An even more fundamental question is: If we can't use the same preprocessing even for the two image classification models in the Closed Division, how are we going to cope with Open Division submissions? If anybody can bring their own model and preprocessing, they may be claiming a superior accuracy simply by overfitting e.g. resizing ImageNet images to 267x272 pixels and then cropping to 232x232.

ajarthurs commented 5 years ago

Thought I chime in and post my findings regarding the discrepancy seen in MLPerf's reported accuracy. My experiments are available at my MLPerf/inference fork.

In my first experiment, I replace MLPerf's entire PIL-based preprocessor with a TF-based preprocessor, which is mostly based on an official ResNet-ImageNet preprocessor. The result is 76.52% accuracy (see related commit).

[TestScenario.SingleStream qps=5317.70, mean=0.0092, time=9.40, acc=76.52, queries=50000, tiles=50.0:0.0089,80.0:0.0099,90.0:0.0103,95.0:0.0106,99.0:0.0112,99.9:0.0130

In my second experiment and in light of the difference in corner-alignment (or lack thereof) between PIL and TF, pointed out by @itayhubara, I substitute in TF's image resizer (tf.compat.v1.image.resize; see related commit). The result is 75.28% accuracy, roughly 1% less than expected. I noticed that TF's resize output is in floating-point (float32) and that PIL typecasts NumPy arrays to np.uint8. When PIL (Image.fromarray()) does this typecast, it truncates the floating-point channel values. I suspect that truncation is creating the said 1% discrepancy.

TestScenario.SingleStream qps=5251.05, mean=0.0093, time=9.52, acc=75.28, queries=50000, tiles=50.0:0.0091,80.0:0.0101,90.0:0.0105,95.0:0.0108,99.0:0.0113,99.9:0.0127
guschmue commented 5 years ago

cool, good to confirm. Now how we we wind yourself out of this ? I could replace PIL with tensorflow or take a look what opencv does or take another look at PIL if there is no way of getting the same as tensorflow.

ajarthurs commented 5 years ago

@guschmue I wouldn't replace PIL with TF. As @psyhtest mentioned, some platforms don't support TF. Another reason is that TF's ResNet-ImageNet preprocessor resizes images in a non-standard fashion. PIL's resize method produces the same output as most other image processing libraries, making it easier to substitute preprocessors and keep results consistent.

Mobile/ResNet's checkpoints were trained with TF's preprocessor, so using it would yield consistently better accuracy. One consideration would be fine-tuning the model with the PIL-based preprocessor, document its accuracy in addition to the 74.97% seen with TF's checkpoint. (BTW, I get 74.99% running ef82b4be38f2ce94f2432c2c4093ba567ebb34ec).

I've not tried OpenCV but my understanding is that it will give the same results as PIL (https://stackoverflow.com/a/54497253/9730945).

guschmue commented 5 years ago

agree with all you said. Another reason why I don't like the tf route is that eventually we want to have the pre-processing part of the timed path so it reflects real user workload. While this was out of scope in 0.5 it might bubble up again in the next version. That is also the reason why pre-processing is built in the app instead of using a simple script. I'll spend a little time on this today.

guschmue commented 5 years ago

This https://github.com/mlperf/inference/pull/243 gets us pretty close:

rt resnet50 mobilenet
tf-gpu 76.12 71.228
tf-cpu 76.12 71.228
onnxruntime-cpu 76.12 71.228
rwightman commented 5 years ago

FYI, I came across this issue looking searching for something inference related. I have spent time looking at this concern in the past so I have a few cents.

All of PIL, TF preprocessing, and OpenCV2 preprocessing will have different results with the same set of model weights with all other parameters of the preprocessing being equal, the jpeg decoding + interpolation impelementation specifics will have an impact.

The impact varies from model to model, some are much more sensitive, a training or even validation scheme that took multiple preprocessing pipelines into consideration would likely end up with a model more robust to all.

I have some measurements recorded for reference. The models are all ResNet-50, model compatible, you can load the weights with the same PyTorch def, should also be compatible with the NVIDIA ResNet50-v1.5. The overall trend you can see in all is that the best validation score is usually associated with the interpolation method and specific library used in training.

My ResNet-50 model, trained with PyTorch PIL pipeline with a 50/50 blend of bilinear and bicubic interpolation. PIL bicubic - 78.47 PIL bilinear - 78.36 CV2 bicubic - 77.51 CV2 bilinear - 78.02 Tensorflow bicubic - 77.52 Tensorflow bilinear - 77.95

PyTorch torchvision ResNet-50 trained with PIL pipeline with bilinear interpolation PIL bicubic - 75.858 PIL bilinear - 76.13 CV2 bicubic - 74.096 CV2 bilinear - 75.308 TF bicubic - 74.046 TF bilinear - 75.332

MXNet Gluon Resnet-50 V1B trained in MxNet by Amazon, ported to PyTorch by me. MxNet gluon models look like they're trained with bilinear interpolation using OpenCV2 underneath PIL bicubic - 77.578 PIL bilinear - 77.394 CV2 bicubic - 77.412 CV2 bilinear - 77.676 Tensorflow bicubic - 77.412 Tensorflow bilinear - 77.646

I don't have weights written down for a TF trained model running with PIL validation, but I did run some tests when porting EfficientNet models and recall there being 0.4-0.5ish differences in top-1 for the b0 that I was looking at initially.

So, they're all different, but OpenCV and TF do seem to be pretty close most of the time, I don't have enough data to say it's a rule, but seems that PIL is the odd duck out in this trio. PIL is also the slowest, with OpenCV being much faster at runtime.

guschmue commented 5 years ago

trained with PyTorch PIL pipeline with a 50/50 blend of bilinear and bicubic ... very clever, love it.

For the accuracy - its pretty close now, let me check with some folks if we need to get closer.

christ1ne commented 5 years ago

we agreed to switch to OpenCV.