Image Classification. Difference in statistics on test set.

EBazarov commented 5 years ago

Before creating a new issue, please make sure that:

[X] the problem isn't already identified in the in the FAQ page,
[X] your problem isn't listed in an existing issue (open or closed one),
[X] all prerequisite tools and dependencies are installed.

If Ok, please give as many details as possible to help us solve the problem more efficiently.

Configuration

Version of DeepDetect:
- [X] Locally compiled on:
  - [X] Ubuntu 14.04 LTS
  - [ ] Mac OSX
  - [ ] Other:
- [ ] Docker
- [ ] Amazon AMI
Commit (shown by the server when starting): 69b4eb6da8e618c34295a90ac2e9f6610cbdac97

Your question / the problem you're facing:

During training image classifier on my train set DeepDetect shows me statistics of test set that are completely different from what I can see after predicting my test set using API calls. For example DeepDetect can show me that I have accp: 96.0 on test set, but when I will repredict it using API call I will have accp: 77.0. The difference is huge and I don't know where it can come from.

So in short: when I train using train.lmdb and test.lmdb I have one stats, but when I try to predict image by image it turns that model doesn't have same stats on same dataset.

Maybe @YaYaB will have something to add.

Error message (if any) / steps to reproduce the problem:

[X] list of API calls: Service creation:

{
"description": "classification model",
"mllib": "caffe",
"model": {
    "repository": "/some/repo",
    "templates": "../templates/caffe",
    "weights": "/some/googlenet_model.caffemodel"
},
"parameters": {
    "input": {
        "connector": "image",
        "db": true,
        "height": 224,
        "width": 224
    },
    "mllib": {
        "finetuning": true,
        "gpu": true,
        "gpuid": 0,
        "nclasses": 2,
        "regression": false,
        "template": "googlenet"
    },
    "output": {}
},
"sname": "img_clf_googlenet",
"type": "supervised"
}

Training:

{
    "async": true,
    "data": [
        "/train.lmdb",
        "/test.lmdb"
    ],
    "parameters": {
        "input": {
            "connector": "image",
            "db": true,
            "height": 224,
            "shuffle": true,
            "test_split": -1.1,
            "width": 224
        },
        "mllib": {
            "class_weights": [
                1.0,
                1.0
            ],
            "gpu": true,
            "gpuid": 0,
            "net": {
                "batch_size": 16,
                "test_batch_size": 16
            },
            "resume": false,
            "solver": {
                "base_lr": 0.001,
                "gamma": 0.1,
                "iter_size": 1,
                "iterations": 800000,
                "lr_policy": "step",
                "momentum": 0.9,
                "snapshot": 40000,
                "solver_type": "ADAM",
                "stepsize": 80000,
                "test_initialization": true,
                "test_interval": 10000,
                "weight_decay": 1e-05
            }
        },
        "output": {
            "best": 2,
            "measure": [
                "accp",
                "mcll",
                "f1",
                "mcc"
            ]
        }
    }
}

Predicting:

{
        "data": [
            "/path/to/img1.png",
            "/path/to/img2.png",
            "/path/to/img3.png",
            ...
        ],
        "parameters": {
            "input": {
            },
            "mllib": {
                "gpu": True,
                "gpuid": 0
            },
            "output": {
                "best": 2
            }
        }
    }

YaYaB commented 5 years ago

Hi!

This might be due to the fact that the mean has been removed twice. One thanks to the binary.proto and the other one by the mean_values that can be indicated in the deploy.prototxt.

I did those tests in local and here is what I saw :

Only with mean_binary.proto the results are similar to those seen in training
Only with mean_values the results are similar to those seen in training
Without any of those the performance are worse as well
With both the results are worse as well

YaYaB commented 5 years ago

But I added an other test:

Predict from LMDB
Predict from list of files directly

Using the same model I don't get the same results, a difference in accuracy up to 2% is seen

YaYaB commented 5 years ago

I did a small experiment based on the example of classification provided here https://www.deepdetect.com/tutorials/imagenet-classifier/.

I predicted the ambulance.jpg directly and I obtained the following:

curl -X POST "http://localhost:8081/predict" -d '{            
       "service":"imageserv",
       "parameters":{
         "input":{
           "width":224,
           "height":224
         },
         "output":{
           "best":3
         }
       },  
       "data":["/home/PATH/ambulance.jpg"]
     }'

{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","service":"imageserv","time":2455.0},"body":{"predictions":[{"classes":[{"prob":0.9932461977005005,"cat":"n02701002 ambulance"},{"prob":0.006546668708324432,"cat":"n03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria"},{"prob":0.0000904839689610526,"last":true,"cat":"n03769881 minibus"}],"uri":"/home/PATH/ambulance.jpg"}]}}

Then I created a LMDB from the ambulance.jpg, and I launched a prediction on this

curl -X POST "http://localhost:8081/predict" -d '{         
       "service":"imageserv",
       "parameters":{
         "input":{
           "width":224,
           "height":224
         },
         "output":{
           "best":3
         }
       },
       "data":["/home/PATH/ambulance.lmdb"]
     }'
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","service":"imageserv","time":1205.0},"body":{"predictions":[{"classes":[{"prob":0.9896200299263,"cat":"n02701002 ambulance"},{"prob":0.010075603611767292,"cat":"n03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria"},{"prob":0.00011839399667223915,"last":true,"cat":"n03769881 minibus"}],"uri":"00000000_/home/PATH/ambulance.jpg"}]}}

We see that the predictions are not the same, they vary not that much but still (0.9896200299263 vs 0.9932461977005005 for the ambulance category)

YaYaB commented 5 years ago

I digged a bit in the code and I found something strange. When the LMDB is created we can see here that the encoding is specified in the function EncodeCVMatToDatum. However when predicting directly from an image we see here that the encoding is not specified by using the function CVMatToDatum.

I modified the code in io.cpp and forced to use CVMatToDatum and then I saw that the predictions were exact.

beniz commented 5 years ago

Interesting thanks, @fantes can you double check on this ?

fantes commented 5 years ago

I will

beniz commented 5 years ago

@YaYaB Can you try caffe::EncodeCVMatToDatum here in place of the CVMatToDatum, https://github.com/jolibrain/deepdetect/blob/master/src/backends/caffe/caffeinputconns.h#L213 ?

Typically, like this:

caffe::EncodeCVMatToDatum(this->_images.at(i),guess_encoding(_uris.at(i)),&datum);

We use EncodeCVMatToDatum when creating the LMDBs to not store potentially large volumes in raw format.

Let us know whether the change above works for you. It's not definitive, but we may make the encoding optional on predict call. We don't want to force the encoding since the image has already been decoded and on embedded devices it's a loss of compute.

Thanks @fantes for the code change.

YaYaB commented 5 years ago

@beniz it seems to work for the example https://www.deepdetect.com/tutorials/imagenet-classifier/. However it does not work for our case. We get the following error when doing the prediction :

{"status":{"code":500,"msg":"InternalError","dd_code":1007,"dd_msg":"/build/opencv-ys8xiq/opencv-2.4.9.1+dfsg/modules/highgui/src/loadsave.cpp:356: error: (-215) buf.data && buf.isContinuous() in function imdecode_\n"}}

After digging a bit more, I found that this is due to the mean.binaryproto that we have in our case. If you put it in your classifier example we get the same error.

fantes commented 5 years ago

you are very right, just after this line, the mean values are removed, but hand-made access assumes non encoded datum. A solution would be to remove mean on the cv::Mat then do the EncodeCVMatToDatum

YaYaB commented 5 years ago

I understand, yep that might be a good solution

YaYaB commented 5 years ago

However you may have an issue because the Matrix contains uchar. Removing the mean for cv::Mat might do nasty thing as getting negative value cast to uchar. It might be necessary to convert it to float Matrix.

YaYaB commented 5 years ago

I digged a bit in the code and I found something strange. When the LMDB is created we can see here that the encoding is specified in the function EncodeCVMatToDatum. However when predicting directly from an image we see here that the encoding is not specified by using the function CVMatToDatum.

I modified the code in io.cpp and forced to use CVMatToDatum and then I saw that the predictions were exact.

Sorry but on this test I did another modification I did not tell. I modify how the resize is done in the LMDB. As you can see here, the resize is either CV_INTER_NN or CV_INTER_LINEAR. But when doing the prediction [here][https://github.com/jolibrain/deepdetect/blob/master/src/backends/caffe/caffeinputconns.cc#L624], the resize is using CV_INTER_CUBIC.

So I've done the following test:

Modify only resize method ==> predictions different
Modify only lmdb encoding (remove it) ==> predictions different
Modify resize method and lmdb encoding ==> predictions are the same

beniz commented 5 years ago

Hi @YaYaB, this is because your model is overfitted, with high certainty. Actually, the difference of prediction between the encoded and raw version of images is also a consequence of overfitting, very certainly. Interestingly, this makes the test with and without LMDB a good proxy to get a measure of an image model overfit. We've talked about it internally, and it sounds actually useful :)

We may provide an encode boolean API parameter that would force encoding on POST /predict but it would be false by default since it does not really make sense to decode / encode at predict time vs transmitting the raw pixels straight to the GPUs / CPU memory.

Now, in practice, you should definitely use data augmentation at training time. It would actually be interesting to see how and whether the prediction accuracy gap reduces with data augmentation applied.

YaYaB commented 5 years ago

@beniz I understand your point, however this does not resolve the issue. When you build a machine learning model, you apply the same preprocessing (that you learn on the train data) on the train and the test set. Here, this is not the case. The way of resize the images and the encoding are not the same.

Yes that might say that the model overfitted, yet I am not sure about this when you see paper about fooling neural networks by changing a patch or few pixels in images (paper_1, paper_2)

We may provide an encode boolean API parameter that would force encoding on POST /predict but it would be false by default since it does not really make sense to decode / encode at predict time vs transmitting the raw pixels straight to the GPUs / CPU memory.

Yes it does not make any sense but that resolves a part of the issue. Yet the resize, why the interpolation method is different when creating LMDB and the prediction?

beniz commented 5 years ago

As explained, we may provide an encode boolean API parameter to allow re-encoding with LMDB over POST /predict.

beniz commented 5 years ago

Yes that might say that the model overfitted, yet I am not sure about this when you see paper about fooling neural networks by changing a patch or few pixels in images (paper_1 paper_2)

This is search within the whole pixel state-space for pixel(s) that maximize model changes, not linear interpolations of pixels in the surrounding of existing patches due to compression methods. Experimenting with data augmentation should give you the empirical validation you may be looking for.

YaYaB commented 5 years ago

See the difference

I did a small experiment based on the example of classification provided here https://www.deepdetect.com/tutorials/imagenet-classifier/.

I predicted the ambulance.jpg directly and I obtained the following:

curl -X POST "http://localhost:8081/predict" -d '{            
       "service":"imageserv",
       "parameters":{
         "input":{
           "width":224,
           "height":224
         },
         "output":{
           "best":3
         }
       },  
       "data":["/home/PATH/ambulance.jpg"]
     }'

{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","service":"imageserv","time":2455.0},"body":{"predictions":[{"classes":[{"prob":0.9932461977005005,"cat":"n02701002 ambulance"},{"prob":0.006546668708324432,"cat":"n03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria"},{"prob":0.0000904839689610526,"last":true,"cat":"n03769881 minibus"}],"uri":"/home/PATH/ambulance.jpg"}]}}

Then I created a LMDB from the ambulance.jpg, and I launched a prediction on this

curl -X POST "http://localhost:8081/predict" -d '{         
       "service":"imageserv",
       "parameters":{
         "input":{
           "width":224,
           "height":224
         },
         "output":{
           "best":3
         }
       },
       "data":["/home/PATH/ambulance.lmdb"]
     }'
{"status":{"code":200,"msg":"OK"},"head":{"method":"/predict","service":"imageserv","time":1205.0},"body":{"predictions":[{"classes":[{"prob":0.9896200299263,"cat":"n02701002 ambulance"},{"prob":0.010075603611767292,"cat":"n03977966 police van, police wagon, paddy wagon, patrol wagon, wagon, black Maria"},{"prob":0.00011839399667223915,"last":true,"cat":"n03769881 minibus"}],"uri":"00000000_/home/PATH/ambulance.jpg"}]}}

We see that the predictions are not the same, they vary not that much but still (0.9896200299263 vs 0.9932461977005005 for the ambulance category)

See the difference here. It's not that much (between 0.9896200299263 and 0.9932461977005005) but we are talking about a 1K classes model. When you do a binary model the gap may be more important. Moroever I did not say that this always happen but that I had a difference of some percentage (between 0 and 5%) in accuracy. So in some images the model can change its prediction between LMDB and real one. This is just about consistency, nothing more

beniz commented 5 years ago

We see that the predictions are not the same, they vary not that much but still (0.9896200299263 vs 0.9932461977005005 for the ambulance category)

But the class is correct because confidence is not accuracy. For confidence fluctuation, you may want to look at https://arxiv.org/abs/1706.04599 and our implementation https://github.com/jolibrain/libtscaling

Calibration is what you need to robustify any analysis that is taken on top of model predictions. You might still see variations based on encoding, cropping and any geometric transform on input images, but confidence after calibration translates into a readable probability of error, which is what I believe you are looking for.

In summary, and outside of any API changes we may be able to provide to fill the gap:

Accuracy changes due to encoding are due to overfitting and/or any other dataset / training trouble
Confidence changes are normal and can be soundly turned into true error probabilities through calibration.

YaYaB commented 5 years ago

Thanks for the interesting reading. I'll have a look. Still it does not relate directly to the issue that is: Predict an image within LMDB or outside LMDB does not give the same result.

As a user I am not supposed to know that the transformation made on LMDB is different than the one made during prediction. Outside an API change (encode and resize that is different), It would be nice to have at least indicated somewhere that predictions over LMDB differ from predictions from raw images.

YaYaB commented 5 years ago

Here is the fix I've made so that when the prediction is made the data is encoded, then decoded. Moreover, the resize method is set to CV_INTER_LINEAR to be strictly identical to what is done during the LMDB creation. https://github.com/YaYaB/deepdetect/commit/b85e7a9442616c6e8b42aed87c30ae57da179a18

YaYaB commented 4 years ago

Hey guys, Any news on this? At least proposing a way to encode and decode (as what is done in LMDB) for prediction?

jolibrain / deepdetect