jolibrain / deepdetect

Deep Learning API and Server in C++14 support for Caffe, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE
https://www.deepdetect.com/
Other
2.52k stars 560 forks source link

Yahoo open_nsfw model #325

Open cchadowitz opened 7 years ago

cchadowitz commented 7 years ago

Creating an issue to track the investigation into why the confidences differ when using the provided python script (uses pycaffe) to classify images with the Yahoo open NSFW model and when using the model within DeepDetect.

I'm using this 224x224 black image to test. black-224x224

I've compiled the caffe branch from beniz/caffe at the latest commit (4ed34f50674e040715c6fdd2d3a296dfc2e23793), which is the same as in use for DeepDetect itself at the latest commit (7abe055ba8ad5ab4ea8259e90b0a70b53abcecf7). Both are only using CPU.

My modified python script (based on the original from the Yahoo open_nsfw repo) removes the center-crop preprocessing step (as DeepDetect does not do any center-crop), and adds arguments to easily extract the data from specified layers to disk to allow direct comparison with the output from DeepDetect. Example:

python extract_layer_nsfw.py /path/to/black-224x224.jpg --extract-layer data --layer-file ./data_layer.txt --model_def /path/to/open_nsfw/nsfw_model/deploy.prototxt --pretrained_model /path/to/open_nsfw/nsfw_model/resnet_50_1by2_nsfw.caffemodel

To compare with DeepDetect, I essentially load the NSFW model into DeepDetect (let me know if you want a copy of my deploy.prototxt used with DeepDetect):

curl -X PUT "http://localhost:8080/services/nsfw0" -d '{
  "mllib":"caffe",
  "description":"nsfw",
  "type":"unsupervised",
  "parameters":{
    "input":{
      "connector":"image",
      "width":224,
      "height":224,
      "mean":[104, 117, 123]
    },
    "mllib":{
      "nclasses":2
    }
  },
  "model":{
    "repository":"/opt/models/nsfw/"
  }
}'

Then to get the data values from DeepDetect, I use:

curl -X POST "http://localhost:${SVCPORT}/predict" -d '{
  "service":"nsfw0",
  "parameters":{
    "mllib":{
      "extract_layer": "data"
    }
  },
  "data":["/path/to/black-224x224.jpg"]
 }' | tr , '\n' > dd-layer-data.txt

I use tr to convert the commas to newlines to make it easier to diff against the output from the python script.

I can also attach the output I have for both the python and DD versions if need be.

cchadowitz commented 7 years ago

For the data layer, the values matched exactly.

For the next layer (conv_1), the values vastly diverge. I've uploaded the two outputs to gists, but they are fairly large.

Python script conv_1 DeepDetect conv_1

cchadowitz commented 7 years ago

Comparing the python script on CPU and GPU at the conv_1 layer:

$ diff -u layer-conv_1-gpu.txt layer-conv_1-cpu.txt 
--- layer-conv_1-gpu.txt    2017-06-12 13:14:09.576400426 -0400
+++ layer-conv_1-cpu.txt    2017-06-12 11:30:17.987031126 -0400
@@ -363774,7 +363774,7 @@
 0.0716
 0.0716
 0.6634
-1.1228
+1.1227
 0.4352
 0.3858
 0.3858
cchadowitz commented 7 years ago

And comparing DD CPU and DD GPU there is no numerical difference whatsoever.

beniz commented 7 years ago

(let me know if you want a copy of my deploy.prototxt used with DeepDetect)

Yes, this would be useful because DD uses MemoryDataLayer and I don't see it listed in the net's layer, so I'm not sure it's possible to actually get the output of data from DD API.

cchadowitz commented 7 years ago

I'm not sure what you mean by not see it listed in the net's layer - the one from the official repo? It uses a standard data input layer, so I converted it to use a MemoryDataLayer for DD.

layer {
  name: "data"
  type: "MemoryData"
  top: "data"
  top: "label"
  memory_data_param {
    batch_size: 10
    channels: 3
    height: 224
    width: 224
  }
}

See the full modified deploy.prototxt here.

cchadowitz commented 7 years ago

Just an update after more lengthy investigation.

There are a couple ways to do a forward pass in caffe (whether using pycaffe or the c++ api).

Some layers (e.g. BatchNormalization, Scale, and ReLU) are in-place operators. For this model, the first few layers are: data -> conv_1 -> bn_1 -> scale_1 -> relu_1

If we let the forward pass execute beyond conv_1, then the values returned when extracted from conv_1 are actually the result of the in-place operators. E.g. if we do a full forward pass, then the values extracted from conv_1 are actually the result of conv_1+bn_1+scale_1+relu_1 as the latter 3 are in-place operators. If we only do a forward pass up to (and including) conv_1, then the values returned are strictly after applying conv_1.

This difference is the reason for the drastic disparity in values between the python script and DD. In DD, when asked to extract layers for a <given layer>, the forward pass is executed using ForwardFromTo(<first layer>, <given layer>) and no further. The python script is using Forward_all() and then returning the values from <given layer>, which for conv_1 was actually after all the in-place operators were executed as well.

Modifying the python script to also use the equivalent ForwardFromTo(<start>, <end>) method while extracting layer values resulted in the same values between the python script and DD for all the initial layers, all the eltwise layers in the ResNet portion, and all the final layers (to within ~10^-5).

However, when passing a color image in the same way, the values returned from the initial data layer differ by small amounts. These differences likely propagate through the rest of the network forward pass, though I'm not yet sure if this is sufficient and conclusive evidence as to why the final confidences can differ so much.

Edit: I've updated my modified python script here so that it no longer does a full forward pass before extracting the layer values and instead only does a forward pass up to and including the specified layer.

cchadowitz commented 7 years ago

Another update:

It seems that with solid color images, the two approaches behave identically. When photograph-like color images are used, slight differences in pixel values when loading the image from disk and preprocessing the image seem to propagate through the network and cause the confidences to diverge, leading to the belief that this model is highly sensitive to color, center-crop, etc and is likely overfit.

I compared three different methods of loading and preprocessing the input images in python:

The skimage and opencv2 methods seem to result in the most similar outputs, and are closest to the outputs produced by DeepDetect. The PIL method seems to be the outlier.

rperdon commented 6 years ago

I am looking at a similar discrepancy with loading models generated from Digits. Alexnet, VGG16 and Googlenet models are loaded in using the same modification to deploy.prototext, but when DeepDetect is run with those models, the output varies significantly than when the model is tested on the same image within Digits. The discrepancy is bad enough that DD will actually "flip" a binary classifier response.

Is a fix or model parameter option being developed which can align the original python script models to the DD model flow?

beniz commented 6 years ago

I can't comment on digits models specifically besides a warning since it uses the Nvidia version of Caffe afaik.

Most discrepancies may come from wrongly scaled or preprocessed inputs (bgr vs rgb or PIL as above).

rperdon commented 6 years ago

"Most discrepancies may come from wrongly scaled or preprocessed inputs (bgr vs rgb or PIL as above). "

Would there be a way to incorporate some flags to allow some input specific changes?

beniz commented 6 years ago

Could you please check #430 and see whether it fixes the issue, or some of the discrepancies with color images ? My apologies, I should have checked for this potential bug earlier on, hopefully someone here did point it this morning.

cchadowitz commented 6 years ago

Just to keep a log here, #430 did not help the discrepancies in confidences between the yahoo open_nsfw python script and DeepDetect. I tried the 6 different orders for the RGB values for the mean in DeepDetect and none of the values returned lined up (or really came close) to the output from the python script. In fact, it just seemed to show that subtracting mean values for this particular model caused it to only return high confidences for the 'safe for work' class regardless of input. It seems that this model is either (a) sensitive to preprocessing methods/image loading methods or (b) overfit or both. I'll add comments here if/when I have more to add for this.

cchadowitz commented 6 years ago

Another update:

It appears after some discussion that b170a4a7445c856a3a4fe78f8cedee57446c5a5a has (mostly) resolved these discrepancies... Exact confidence values still vary slightly but they're on the order of ~0.03-0.04 off which is pretty close.

The model still seems sensitive to preprocessing (which is why I added a crop param in #433), and still seems overfit, but at the very least it seems to now be performing within reasonable expectations with DeepDetect compared to the original implementation.

beniz commented 6 years ago

The crop probably did it as the commit was reverted (the mean was already correctly subtracted as you found out)

cchadowitz commented 6 years ago

Right - the mean subtraction (once implemented correctly) resolved this to within the 0.03-0.04 range, and the crop seems to bring outlying examples closer on top of that. I'd consider this (tentatively) resolved pending other evidence to the contrary :)