microsoft / ELL

Embedded Learning Library
https://microsoft.github.io/ELL
Other
2.28k stars 294 forks source link

Darknet- and converted ELL-model give different inference results #138

Open sdudeck opened 6 years ago

sdudeck commented 6 years ago

Hello, I am trying to convert a small darknet-based cnn (originating from https://github.com/ashitani/darknet_mnist, working on the mnist dataset) to ELL. I trained the darknet-model and afterwards followed the tutorial on this page for converting darknet models to ELL. After training and before converting I removed the cost-layer and the dropout-layer from the original darknet-model before converting, as they are used for training only as far as I have understood. (I did this because at first the darknet cost layer gave me a warning message during conversion - "sse not known" or something like that - and the dropout layer also seemed not to be converted into the ELL model).

After figuring out that I need to feed the mnist images not in the color channel range [0..1] (as in the darknet-framework) but [0..255] (as the ELL model automatically includes a scaling layer), I run the model on the same mnist images in the darknet- and ELL-framework. I checked that both models get the same array / vector (2352 float values, 28x28x3 values) of values in the same order (beside the scaling mentioned above).

The problem is, that I get very different prediction results from the models. E.g the darknet model gives me on one image a 93% for the most probably class (which is the right one), whereas the converted ELL-model gives only 17% for that class - it is still the most probable, but I would expect to get prediction results which are much closer to each other, as the model structure and weights should be (nearly) the same?

result of darknet-model: data/mnist/images/v_01862_c4.png: Predicted in 0.054000 seconds. c4: 0.933212 c9: 0.046099 c8: 0.008588 c5: 0.003347 c7: 0.002880

result of model converted to ELL: D:\Crest\Libs\darknet_mnist\data\mnist\images\v_01862_c4.png (17%) 4 (16%) 1 (12%) 9 (11%) 5 (11%) 3 Mean prediction time: 7ms/frame

Right now I have no idea what can cause this huge difference and where to look further.

I have attached the original darknet-cfg file (mnist_lenet.cfg) as well as the one used for converting and doing the inference in darknet (mnist_lenet.nodropout_nocost.cfg) and the converted ELL-file (I have removed the weights from that file, otherwise it would have 40 MB).

mnist_lenet.cfg.txt mnist_lenet.nodropout_nocost.cfg.txt mnist_lenet_woweights.ell.txt

Thank you very much, Sven

byronChanguion commented 6 years ago

I noticed that in your ELL file, the first FullyConnectedLayer is correctly followed by a Bias, but is missing the ReLUActivationLayer. I imported both your configs, and the resulting ELL files all have an activation layer between the last two FullyConenctedLayers i.e. the end of the network should look like:

    {
      "_type": "FullyConnectedLayer<float>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 1024],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0,
      "weights_rows": 1024,
      "weights_columns": 3136,
      "weights_values": [#deleted#]
    }, 
    {
      "_type": "BiasLayer<float>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 1024],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0,
      "bias": [#deleted#]
    }, 
    {
      "_type": "ActivationLayer<float,ReLUActivation>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 1024],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0
    }, 
    {
      "_type": "FullyConnectedLayer<float>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 10],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0,
      "weights_rows": 10,
      "weights_columns": 1024,
      "weights_values": [#deleted#]
    }, 
    {
      "_type": "BiasLayer<float>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 10],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0,
      "bias": [#deleted#]
    }, 
    {
      "_type": "SoftmaxLayer<float>",
      "_version": "0",
      "inputPaddingScheme": 0,
      "inputPaddingSize": 0,
      "outputShape": [1, 1, 10],
      "outputPaddingScheme": 0,
      "outputPaddingSize": 0
    }],
    "output": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  }

Can you confirm whether importing using the bits in master now produces a model which includes the correct activation layer?

jesuspicazo commented 6 years ago

Same thing happens to me. After importing my Darknet trained network to ELL it gives very bad results compared to the tests I have performed using just Darknet. What could be the reason?

byronChanguion commented 6 years ago

Can you share your Darknet config and weights files so we can try to import and reproduce the problem?

sdudeck commented 6 years ago

Hello, find attached the two files with the mnist-darknet model.

mnist_lenet.cfg.txt mnist_lenet.weights.txt

I tried two things yesterday:

  1. I just inserted the missing activation layer in the ELL-file as in the snippet above and recompiled it. This new compiled model gave slightly different results but the quality didn't really improved.
  2. I updated my local files from the git-hub repository and recompiled the ELL stuff. Now the darknet-to-ELL import does not work anymore (ImportError: cannot import name 'MapCompilerOptions', tried it on two different darknet models). Full message:
    (py36-ell-env) D:\Crest\DarknetModels>python ./../libs/ell/tools/importers/darknet/darknet_import.py mnist_lenet.cfg mnist_lenet.weights
    Traceback (most recent call last):
    File "./../libs/ell/tools/importers/darknet/darknet_import.py", line 22, in <module>
    import darknet_to_ell
    File "D:\Crest\libs\ell\tools\importers\darknet\darknet_to_ell.py", line 22, in <module>
    import ell
    File "D:\Crest\libs\ell\build\interfaces\python\package\ell\__init__.py", line 22, in <module>
    from . import model
    File "D:\Crest\libs\ell\build\interfaces\python\package\ell\model\__init__.py" , line 9, in <module>
    from ..ell_py import \
    ImportError: cannot import name 'MapCompilerOptions'

    Thanks for helping, Sven

jesuspicazo commented 6 years ago

I have trained a cnn using Darknet to dintinguish between 3 classes of robots. I need ELL to implant this network in a raspberry pi who actuallly is on board of another robot. So the thing is that when I test the network as it gets out of darknet, it reaches like 90-95% accuracy. I import the network as indicated in the tutorial and everything seems to be fine, but when I try it the percentages I obtain are almost always the same and are wrong and they are not similar to the results obtained when testing using darknet whatsoever. I'm attaching the cfg and weights files as required.

robotsGardenCressDoubleFC.cfg.zip

robotsGardenCressDoubleFC.weights.zip

Thank you so much

for this amazing tool and your dedication.

byronChanguion commented 6 years ago

Thanks for the model .cfg and .weights files! I was able to reproduce the problem and found what was causing the errors:

  1. ReLU activation was being skipped by the importer in [connected] layers.
  2. The weights of the [connected] layer need to be transposed by the importer. After fixing those, I get the same results as Darknet. I'll run a few more tests and push a fix within in the next couple days. As a temporary work-around, try replacing the ELL/tools/importers/Darknet/darknet_to_ell.py file with darknet_to_ell.zip
jesuspicazo commented 6 years ago

Thank you very much for your response. I've tried replacing darknet_to_ell.py but it isn't working. In fact, in this case, the predictions are always the same. No matter how different the test images are. I'll be waiting for your fix update. Thank you again.

sdudeck commented 6 years ago

Thanks a lot.

With the workaround py-file I get the ReLu-layer inserted in the ELL-file and the inference output values are different but still do not match the darknet values. So I will wait as well for the complete fix.

P.S.: I got rid of the 'MapCompilerOptions'-error mentioned above by pulling a clean version of the current ELL-repository and compiling it again.

jesuspicazo commented 6 years ago

Hi, I'm just relaunching this issue because I'm still not able to import a Darknet-trained CNN properly using the import tool from ELL. To make this problem easily reproducible, I have trained a very simple CNN which is the one that is in the Darknet tutorial for training a classifier on CIFAR-10 dataset. https://pjreddie.com/darknet/train-cifar/ After training the network, I import the model exactly the way it is explained in the ELL c++ tutorial but when I try to recognize the images of the cifar-10 test set I obtain the following: -nan -nan -nan I also have observed that this happens when I set the activation of the convolutional layers as 'leaky'. When I turn them into 'relu' it doesn't give '-nan' but the results are very bad and they are always the same even with very different test images. Here I attach the .cfg and .weights files so the problem can b tested: cifar_small.cfg.zip cifar_small.weights.zip

For more details, I've tested this network using Darknet and it works as expected so I don't know whether I'm making any kind of mistake when using the darknet_import.py tool. Because I don't know what else it could be. Please, I've been dealing with this issue for like 2 months now and any help would be highly appreciated.

Thanks a lot in advance.

Cheers.

lovettchris commented 6 years ago

Thanks for the bug report, I have filed this internally to make sure we take a look and fix it.