mlpack / models

models built with mlpack
https://models.mlpack.org/docs
BSD 3-Clause "New" or "Revised" License
35 stars 40 forks source link

Addition of MobileNet_V1 #72

Closed Aakash-kaushik closed 3 years ago

Aakash-kaushik commented 3 years ago

Hey, So i just pushed some code and the model is almost ready but i cannot verify the output just because i don't know how i can init all weight in tesnorflow to a specfic number. Do you guys know about that @zoq @kartikdutt18 ?

kartikdutt18 commented 3 years ago

Take a look at https://www.tensorflow.org/api_docs/python/tf/keras/initializers/Constant

Aakash-kaushik commented 3 years ago

Take a look at https://www.tensorflow.org/api_docs/python/tf/keras/initializers/Constant

Hey so i kinda hacked this because this was a keras model so this works. thanks for helping out.

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import numpy as np

model = tf.keras.applications.mobilenet.MobileNet(
    input_shape=None, alpha=1.0, depth_multiplier=1, dropout=0.001,
    include_top=True, weights='imagenet', input_tensor=None, pooling=None,
    classes=1000, classifier_activation='softmax')
for layer in model.layers:
    if len(layer.get_weights()) == 1:
        layer.set_weights([np.ones(layer.get_weights()[0].shape)])
    if len(layer.get_weights()) == 2:
        layer.set_weights([np.ones(layer.get_weights()[0].shape),
                           np.ones(layer.get_weights()[1].shape)])
    if len(layer.get_weights()) == 4:
        layer.set_weights([np.ones(layer.get_weights()[0].shape),
                           np.ones(layer.get_weights()[1].shape),
                           np.ones(layer.get_weights()[2].shape),
                           np.ones(layer.get_weights()[3].shape)])
Aakash-kaushik commented 3 years ago

Would you guys be able to have a call about the verification and pretrained weights for this ?

kartikdutt18 commented 3 years ago

~Can we schedule it for tomorrow?~ Conversation moved to IRC.

Aakash-kaushik commented 3 years ago

~Can we schedule it for tomorrow?~ Conversation moved to IRC.

Hey, didn't saw this earlier, if you want to we can have it tomorrow.

Aakash-kaushik commented 3 years ago

So I tried converting and changing the model from this repo but the code written for this is using TF 1.5 or something basically before the tf2.0 changes and so your usual PB files or the file format Keras saves won't work(the mobilenet v1 implementation in TF is of Keras and uses Keras layers than the TF ones).

second I sought out if something could convert onnx models to PyTorch, for that I found this(https://github.com/ToriML/onnx2pytorch) so to try this I saved a mobilenetv1 and supplied that to this tool(https://github.com/onnx/tensorflow-onnx) which converts your TensorFlow models to onnx and then using the above-mentioned tools I translated it from onnx to PyTorch but that give me dimension errors, the mobilenet I saved from TF has [224, 224, 3] args because of the channel last arg but torch expects them in [batch, channel, input W, input H] but the converter doesn't account for this and so the dimensions don't match.

And onnx also provides some weights/models but they also had mobilenetV2 and not V1(https://github.com/onnx/models/tree/master/vision/classification/mobilenet).

So I wanted to discuss if at this point it is worth spending more time trying to find weights for this or I can spend my remaining time getting the vtable PR to speed which will help us implement onnx-to-mlpack and vice-versa conversions.

it's not that I am trying to leave this because it's hard but it's just not the best usage of resources. and for that we do have the architecture it's just the problem about weights, even with the current architecture there are a huge number of variants available just for the fact that you can change the image dims, alpha, and depth multiplier to change the whole architecture and things scale proportionally with that.

cc: @kartikdutt18, @zoq

zoq commented 3 years ago

second I sought out if something could convert onnx models to PyTorch, for that I found this(https://github.com/ToriML/onnx2pytorch) so to try this I saved a mobilenetv1 and supplied that to this tool(https://github.com/onnx/tensorflow-onnx) which converts your TensorFlow models to onnx and then using the above-mentioned tools I translated it from onnx to PyTorch but that give me dimension errors, the mobilenet I saved from TF has [224, 224, 3] args because of the channel last arg but torch expects them in [batch, channel, input W, input H] but the converter doesn't account for this and so the dimensions don't match.

I'm not sure I see the reason to convert from ONNX to TF and from TF to Pytorch maybe I missing something? We are only interested in the Pytorch model right?

That said I also looked for other projects and found https://github.com/ruotianluo/pytorch-mobilenet-from-tf which provides exported models to Pytorch for v1 and v2 and is based on the TF implementation.

Aakash-kaushik commented 3 years ago

second I sought out if something could convert onnx models to PyTorch, for that I found this(https://github.com/ToriML/onnx2pytorch) so to try this I saved a mobilenetv1 and supplied that to this tool(https://github.com/onnx/tensorflow-onnx) which converts your TensorFlow models to onnx and then using the above-mentioned tools I translated it from onnx to PyTorch but that give me dimension errors, the mobilenet I saved from TF has [224, 224, 3] args because of the channel last arg but torch expects them in [batch, channel, input W, input H] but the converter doesn't account for this and so the dimensions don't match.

I'm not sure I see the reason to convert from ONNX to TF and from TF to Pytorch maybe I missing something? We are only interested in the Pytorch model right?

That said I also looked for other projects and found https://github.com/ruotianluo/pytorch-mobilenet-from-tf which provides exported models to Pytorch for v1 and v2 and is based on the TF implementation.

Hi, @zoq i checked this out but can you help me in loading one of the weights, I am having a hard time figuring that out. that repo already has converted weights here but i can't load them when i create the model from the mobilenet.py file which has the model definition.

my simple code to load the weights:

import torch
from mobilenet import MobileNet

net = MobileNet() 
net.load_state_dict(torch.load("./weights/mobilenet_v1_1.0_224.pth"))
print(net)

Btw if we can get this to work we will have all the models :1st_place_medal:

zoq commented 3 years ago

You are right that one was build against a legacy Pytorch version. I just found https://github.com/ZFTurbo/MobileNet-v1-Pytorch, and gave it a quick test:

import torch
from mobilenet_v1 import MobileNet_v1
model = MobileNet_v1(1000, alpha=0.25, input_size=128, include_top=False)

model.load_state_dict(torch.load("mobilenet_v1_size_128_alpha_0.25_no_top.pth"))
print(model)

it works and it looks like the implementation matches with the official implementation as well.

Aakash-kaushik commented 3 years ago

You are right that one was build against a legacy Pytorch version. I just found https://github.com/ZFTurbo/MobileNet-v1-Pytorch, and gave it a quick test:

import torch
from mobilenet_v1 import MobileNet_v1
model = MobileNet_v1(1000, alpha=0.25, input_size=128, include_top=False)

model.load_state_dict(torch.load("mobilenet_v1_size_128_alpha_0.25_no_top.pth"))
print(model)

it works and it looks like the implementation matches with the official implementation as well.

thanks @zoq this should mostly work, though i face a problem where, my parameters/weights from my model match exactly with the pytorch impl but the total number of parameters is different, also i have accounted for layers having/not having bias.

here is the output of our model, the first number represents the number of params in our model including bias but i summed up the bias from layers which is 0 but still the number of params is bigger than of pytorch.

4242920
Conv bias: 32
Conv weights: 864
Batch Norm weights: 64
Sepa Conv bias: 32
Sepa Conv weights: 288
Batch Norm weights: 64
Conv bias: 64
Conv weights: 2048
Batch Norm weights: 128
Sepa Conv bias: 64
Sepa Conv weights: 576
Batch Norm weights: 128
Conv bias: 128
Conv weights: 8192
Batch Norm weights: 256
Sepa Conv bias: 128
Sepa Conv weights: 1152
Batch Norm weights: 256
Conv bias: 128
Conv weights: 16384
Batch Norm weights: 256
Sepa Conv bias: 128
Sepa Conv weights: 1152
Batch Norm weights: 256
Conv bias: 256
Conv weights: 32768
Batch Norm weights: 512
Sepa Conv bias: 256
Sepa Conv weights: 2304
Batch Norm weights: 512
Conv bias: 256
Conv weights: 65536
Batch Norm weights: 512
Sepa Conv bias: 256
Sepa Conv weights: 2304
Batch Norm weights: 512
Conv bias: 512
Conv weights: 131072
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 512
Conv weights: 262144
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 512
Conv weights: 262144
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 512
Conv weights: 262144
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 512
Conv weights: 262144
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 512
Conv weights: 262144
Batch Norm weights: 1024
Sepa Conv bias: 512
Sepa Conv weights: 4608
Batch Norm weights: 1024
Conv bias: 1024
Conv weights: 524288
Batch Norm weights: 2048
Sepa Conv bias: 1024
Sepa Conv weights: 9216
Batch Norm weights: 2048
Conv bias: 1024
Conv weights: 1048576
Batch Norm weights: 2048
Conv bias: 1000
Conv weights: 1024000

And below is the output from the pt model:

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
================================================================
     ConstantPad2d-1          [-1, 3, 245, 245]               0
            Conv2d-2         [-1, 32, 122, 122]             864
       BatchNorm2d-3         [-1, 32, 122, 122]              64
             ReLU6-4         [-1, 32, 122, 122]               0
            Conv2d-5         [-1, 32, 122, 122]             288
       BatchNorm2d-6         [-1, 32, 122, 122]              64
             ReLU6-7         [-1, 32, 122, 122]               0
            Conv2d-8         [-1, 64, 122, 122]           2,048
       BatchNorm2d-9         [-1, 64, 122, 122]             128
            ReLU6-10         [-1, 64, 122, 122]               0
    ConstantPad2d-11         [-1, 64, 123, 123]               0
           Conv2d-12           [-1, 64, 61, 61]             576
      BatchNorm2d-13           [-1, 64, 61, 61]             128
            ReLU6-14           [-1, 64, 61, 61]               0
           Conv2d-15          [-1, 128, 61, 61]           8,192
      BatchNorm2d-16          [-1, 128, 61, 61]             256
            ReLU6-17          [-1, 128, 61, 61]               0
           Conv2d-18          [-1, 128, 61, 61]           1,152
      BatchNorm2d-19          [-1, 128, 61, 61]             256
            ReLU6-20          [-1, 128, 61, 61]               0
           Conv2d-21          [-1, 128, 61, 61]          16,384
      BatchNorm2d-22          [-1, 128, 61, 61]             256
            ReLU6-23          [-1, 128, 61, 61]               0
    ConstantPad2d-24          [-1, 128, 62, 62]               0
           Conv2d-25          [-1, 128, 30, 30]           1,152
      BatchNorm2d-26          [-1, 128, 30, 30]             256
            ReLU6-27          [-1, 128, 30, 30]               0
           Conv2d-28          [-1, 256, 30, 30]          32,768
      BatchNorm2d-29          [-1, 256, 30, 30]             512
            ReLU6-30          [-1, 256, 30, 30]               0
           Conv2d-31          [-1, 256, 30, 30]           2,304
      BatchNorm2d-32          [-1, 256, 30, 30]             512
            ReLU6-33          [-1, 256, 30, 30]               0
           Conv2d-34          [-1, 256, 30, 30]          65,536
      BatchNorm2d-35          [-1, 256, 30, 30]             512
            ReLU6-36          [-1, 256, 30, 30]               0
    ConstantPad2d-37          [-1, 256, 31, 31]               0
           Conv2d-38          [-1, 256, 15, 15]           2,304
      BatchNorm2d-39          [-1, 256, 15, 15]             512
            ReLU6-40          [-1, 256, 15, 15]               0
           Conv2d-41          [-1, 512, 15, 15]         131,072
      BatchNorm2d-42          [-1, 512, 15, 15]           1,024
            ReLU6-43          [-1, 512, 15, 15]               0
           Conv2d-44          [-1, 512, 15, 15]           4,608
      BatchNorm2d-45          [-1, 512, 15, 15]           1,024
            ReLU6-46          [-1, 512, 15, 15]               0
           Conv2d-47          [-1, 512, 15, 15]         262,144
      BatchNorm2d-48          [-1, 512, 15, 15]           1,024
            ReLU6-49          [-1, 512, 15, 15]               0
           Conv2d-50          [-1, 512, 15, 15]           4,608
      BatchNorm2d-51          [-1, 512, 15, 15]           1,024
            ReLU6-52          [-1, 512, 15, 15]               0
           Conv2d-53          [-1, 512, 15, 15]         262,144
      BatchNorm2d-54          [-1, 512, 15, 15]           1,024
            ReLU6-55          [-1, 512, 15, 15]               0
           Conv2d-56          [-1, 512, 15, 15]           4,608
      BatchNorm2d-57          [-1, 512, 15, 15]           1,024
            ReLU6-58          [-1, 512, 15, 15]               0
           Conv2d-59          [-1, 512, 15, 15]         262,144
      BatchNorm2d-60          [-1, 512, 15, 15]           1,024
            ReLU6-61          [-1, 512, 15, 15]               0
           Conv2d-62          [-1, 512, 15, 15]           4,608
      BatchNorm2d-63          [-1, 512, 15, 15]           1,024
            ReLU6-64          [-1, 512, 15, 15]               0
           Conv2d-65          [-1, 512, 15, 15]         262,144
      BatchNorm2d-66          [-1, 512, 15, 15]           1,024
            ReLU6-67          [-1, 512, 15, 15]               0
           Conv2d-68          [-1, 512, 15, 15]           4,608
      BatchNorm2d-69          [-1, 512, 15, 15]           1,024
            ReLU6-70          [-1, 512, 15, 15]               0
           Conv2d-71          [-1, 512, 15, 15]         262,144
      BatchNorm2d-72          [-1, 512, 15, 15]           1,024
            ReLU6-73          [-1, 512, 15, 15]               0
    ConstantPad2d-74          [-1, 512, 16, 16]               0
           Conv2d-75            [-1, 512, 7, 7]           4,608
      BatchNorm2d-76            [-1, 512, 7, 7]           1,024
            ReLU6-77            [-1, 512, 7, 7]               0
           Conv2d-78           [-1, 1024, 7, 7]         524,288
      BatchNorm2d-79           [-1, 1024, 7, 7]           2,048
            ReLU6-80           [-1, 1024, 7, 7]               0
           Conv2d-81           [-1, 1024, 7, 7]           9,216
      BatchNorm2d-82           [-1, 1024, 7, 7]           2,048
            ReLU6-83           [-1, 1024, 7, 7]               0
           Conv2d-84           [-1, 1024, 7, 7]       1,048,576
      BatchNorm2d-85           [-1, 1024, 7, 7]           2,048
            ReLU6-86           [-1, 1024, 7, 7]               0
        AvgPool2d-87           [-1, 1024, 1, 1]               0
           Conv2d-88           [-1, 1000, 1, 1]       1,025,000
================================================================
Total params: 4,231,976
Trainable params: 4,231,976
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.68
Forward/backward pass size (MB): 149.97
Params size (MB): 16.14
Estimated Total Size (MB): 166.79
----------------------------------------------------------------

cc: @kartikdutt18

Aakash-kaushik commented 3 years ago

I believe i figured this out, I wasn't accounting for the separable convolutions them being a single layer in pytorch had me in a belief that it was the same case with us.

kartikdutt18 commented 3 years ago

Hey, did you figure out the issue with the parameters count mismatch?

Aakash-kaushik commented 3 years ago

Hey, did you figure out the issue with the parameters count mismatch?

Yup i did, i was just not accounting for separable conv layer's bias. Btw will it be possible for you to have a meet today ?

Aakash-kaushik commented 3 years ago

Hey, did you figure out the issue with the parameters count mismatch?

Yup i did, i was just not accounting for separable conv layer's bias. Btw will it be possible for you to have a meet today ?

the reason i want to do this is because at this point all the weight match from layer to layer but still the output differs, i have checked all the parameters and layers in the model they seem to be same too, so if you guys(@zoq, @kartikdutt18 ) are free today i can take you through and as this is like the last step because everything is in place, weights load correctly and everything i want to figure out why the output doesn't matches.

kartikdutt18 commented 3 years ago

Sure, Can you share layer wise output of PyTorch and Convolution first though. Should help in debugging faster.

Aakash-kaushik commented 3 years ago

Sure, Can you share layer wise output of PyTorch and Convolution first though. Should help in debugging faster.

You basically want the output of every layer as the network proceeds with pytorch and mlpack model, right ?

kartikdutt18 commented 3 years ago

Yes.

Aakash-kaushik commented 3 years ago

Yes.

give me some time, I would need to write that.

Aakash-kaushik commented 3 years ago

Hey @kartikdutt18 here are the outputs: PyTorch

conv2d:  tensor(-45860.0547, grad_fn=<SumBackward0>)
batchnorm:  tensor(550614.2500, grad_fn=<SumBackward0>)
conv2d:  tensor(373389.6875, grad_fn=<SumBackward0>)
batchnorm:  tensor(561251.7500, grad_fn=<SumBackward0>)
conv2d:  tensor(224199.2969, grad_fn=<SumBackward0>)
batchnorm:  tensor(1017579.6250, grad_fn=<SumBackward0>)
conv2d:  tensor(-415788.6875, grad_fn=<SumBackward0>)
batchnorm:  tensor(292581.5625, grad_fn=<SumBackward0>)
conv2d:  tensor(-2640.6738, grad_fn=<SumBackward0>)
batchnorm:  tensor(497625.2188, grad_fn=<SumBackward0>)
conv2d:  tensor(138362.6562, grad_fn=<SumBackward0>)
batchnorm:  tensor(423388.5625, grad_fn=<SumBackward0>)
conv2d:  tensor(227647.0625, grad_fn=<SumBackward0>)
batchnorm:  tensor(254633.2188, grad_fn=<SumBackward0>)
conv2d:  tensor(-124248.6406, grad_fn=<SumBackward0>)
batchnorm:  tensor(139964.7969, grad_fn=<SumBackward0>)
conv2d:  tensor(-43954.3633, grad_fn=<SumBackward0>)
batchnorm:  tensor(250448.5625, grad_fn=<SumBackward0>)
conv2d:  tensor(-96189.6328, grad_fn=<SumBackward0>)
batchnorm:  tensor(90505.0078, grad_fn=<SumBackward0>)
conv2d:  tensor(13347.8633, grad_fn=<SumBackward0>)
batchnorm:  tensor(-4256.4902, grad_fn=<SumBackward0>)
conv2d:  tensor(-81547.1953, grad_fn=<SumBackward0>)
batchnorm:  tensor(51608.6328, grad_fn=<SumBackward0>)
conv2d:  tensor(-42817.5469, grad_fn=<SumBackward0>)
batchnorm:  tensor(65563.0391, grad_fn=<SumBackward0>)
conv2d:  tensor(-1870.4598, grad_fn=<SumBackward0>)
batchnorm:  tensor(-101.1597, grad_fn=<SumBackward0>)
conv2d:  tensor(-2912.0908, grad_fn=<SumBackward0>)
batchnorm:  tensor(57431.2852, grad_fn=<SumBackward0>)
conv2d:  tensor(33256.4023, grad_fn=<SumBackward0>)
batchnorm:  tensor(32491.9473, grad_fn=<SumBackward0>)
conv2d:  tensor(-12349.6445, grad_fn=<SumBackward0>)
batchnorm:  tensor(39660.7344, grad_fn=<SumBackward0>)
conv2d:  tensor(63287.2969, grad_fn=<SumBackward0>)
batchnorm:  tensor(54419.5859, grad_fn=<SumBackward0>)
conv2d:  tensor(14946.5059, grad_fn=<SumBackward0>)
batchnorm:  tensor(21691.9277, grad_fn=<SumBackward0>)
conv2d:  tensor(5632.3882, grad_fn=<SumBackward0>)
batchnorm:  tensor(57606.0273, grad_fn=<SumBackward0>)
conv2d:  tensor(53479.3750, grad_fn=<SumBackward0>)
batchnorm:  tensor(33929.6406, grad_fn=<SumBackward0>)
conv2d:  tensor(2767.0325, grad_fn=<SumBackward0>)
batchnorm:  tensor(81392.2188, grad_fn=<SumBackward0>)
conv2d:  tensor(-8610.2500, grad_fn=<SumBackward0>)
batchnorm:  tensor(-16468.8066, grad_fn=<SumBackward0>)
conv2d:  tensor(-11710.6553, grad_fn=<SumBackward0>)
batchnorm:  tensor(33556.3164, grad_fn=<SumBackward0>)
conv2d:  tensor(-29153.9375, grad_fn=<SumBackward0>)
batchnorm:  tensor(-90239.8125, grad_fn=<SumBackward0>)
conv2d:  tensor(-15378.9824, grad_fn=<SumBackward0>)
batchnorm:  tensor(22602.9609, grad_fn=<SumBackward0>)
conv2d:  tensor(-34095.0234, grad_fn=<SumBackward0>)
batchnorm:  tensor(-400205.2500, grad_fn=<SumBackward0>)
conv2d:  tensor(-18.0947, grad_fn=<SumBackward0>)

mlpack:

conv output: -46167.9
batchnorm output: 550852
Sepa conv output: 379809
batchnorm output: 562477
conv output: 225748
batchnorm output: 1.02108e+06
Sepa conv output: -421790
batchnorm output: 291589
conv output: -2500.69
batchnorm output: 497002
Sepa conv output: 139429
batchnorm output: 423419
conv output: 226668
batchnorm output: 253268
Sepa conv output: -125061
batchnorm output: 139577
conv output: -43436.5
batchnorm output: 250904
Sepa conv output: -100578
batchnorm output: 88872.3
conv output: 10999.1
batchnorm output: -8174.28
Sepa conv output: -85056.2
batchnorm output: 50544.5
conv output: -42077.7
batchnorm output: 66331
Sepa conv output: -681.53
batchnorm output: 213.26
conv output: -6286.62
batchnorm output: 52974.5
Sepa conv output: 26483.3
batchnorm output: 28276.2
conv output: -17524.1
batchnorm output: 32793.3
Sepa conv output: 43289.6
batchnorm output: 43546.6
conv output: 13107.5
batchnorm output: 19819.6
Sepa conv output: -8705.09
batchnorm output: 49713.2
conv output: 42102.4
batchnorm output: 17254.5
Sepa conv output: -18566
batchnorm output: 67083
conv output: -13169.4
batchnorm output: -21310.4
Sepa conv output: -15330.3
batchnorm output: 31669.6
conv output: -20238.7
batchnorm output: -71215.8
Sepa conv output: -17063.4
batchnorm output: 20837
conv output: -29880.1
batchnorm output: -359688
mean pooling output: 85.2322
conv output: -23.8471
Aakash-kaushik commented 3 years ago

I do see that all the outputs are off by a margin but i can't figure out a specific reason.

kartikdutt18 commented 3 years ago

Can you add a test in separable convolution where num_groups = in_size, because I think in mobile net the number of groups is equal to number of input channels.

Aakash-kaushik commented 3 years ago

Can you add a test in separable convolution where num_groups = in_size, because I think in mobile net the number of groups is equal to number of input channels.

Yup that is true but all the groups are already equal to insize, as you can see in the code btw the first layer in both the models is a simple conv layer but that also differs with the output.

Aakash-kaushik commented 3 years ago

Hey, wait so i tried what you suggested and the outputs do seem to match a bit, let me take a closer look and update you

Aakash-kaushik commented 3 years ago

Nice, they do match up now !! :partying_face: 16 pretrained models, coming right up :100:

kartikdutt18 commented 3 years ago

Was padding the issue, cause 225 in PyTorch changes to 112 and in ours ig 224 or 226 changes to 112.

Aakash-kaushik commented 3 years ago

Was padding the issue, cause 225 in PyTorch changes to 112 and in ours ig 224 or 226 changes to 112.

Yes that was the exact issue.

Aakash-kaushik commented 3 years ago

for now there is a segfault in tests and also ryan has to upload the models after these two things it should be good to go, until then maybe you can review the code for any changes that you might like.

Aakash-kaushik commented 3 years ago

ryan uploaded the models and i fixed the seg issue in tests, so this is ready for a review.

Aakash-kaushik commented 3 years ago

Some tests for resnet have actually failed i think, will take a look tomorrow.

Aakash-kaushik commented 3 years ago

All the pre trained resnet test cases fail, aslo show in the CI but for some reason not reported

Downloading resnet18.bin to /home/aakash/.cache/mlpack/models/weights/resnet/resnet18.bin
######################################################################### 100.0%

error: arma::memory::acquire(): out of memory
-------------------------------------------------------------------------------
PreTrainedResNetModelTest
-------------------------------------------------------------------------------
/home/aakash/models/tests/ffn_model_tests.cpp:154
...............................................................................

/home/aakash/models/tests/ffn_model_tests.cpp:154: FAILED:
due to unexpected exception with message:
  std::bad_alloc

Downloading resnet101.bin to /home/aakash/.cache/mlpack/models/weights/resnet/resnet101.bin
######################################################################### 100.0%

error: arma::memory::acquire(): out of memory
-------------------------------------------------------------------------------
PreTrainedResNet101ModelTest
-------------------------------------------------------------------------------
/home/aakash/models/tests/ffn_model_tests.cpp:176
...............................................................................

/home/aakash/models/tests/ffn_model_tests.cpp:176: FAILED:
due to unexpected exception with message:
  std::bad_alloc

Do you guys have any idea why this is ?

Aakash-kaushik commented 3 years ago

hey @zoq, were you able to get something for the resnet error ?

zoq commented 3 years ago

On my local system I get:

/home/marcus/code/models-aakash/tests/ffn_model_tests.cpp:176: FAILED:
due to unexpected exception with message:
  Invalid 'which' selector whendeserializing boost::variant

when I try to run the test. Do you know if you saved the model after https://github.com/mlpack/mlpack/commit/96703ce69d67093220d78ba0756b71fca99b9fc8 was merged or before?

zoq commented 3 years ago

We have the same issue if we check the latest build on the master branch:

https://dev.azure.com/mlpack/mlpack/_build/results?buildId=6873&view=logs&j=24d3abe3-ef0b-5deb-3aab-64d839de2c3c&t=b3ac0536-a54b-5ea6-5d9c-1dbe7a03dc98

zoq commented 3 years ago

Wondering if /home/aakash/.cache/mlpack/models/weights/resnet/resnet18.bin is different from the version we download from mlpack.org.

Aakash-kaushik commented 3 years ago

Wondering if /home/aakash/.cache/mlpack/models/weights/resnet/resnet18.bin is different from the version we download from mlpack.org.

I can delete the files from there and let it download them, let's see what i get then.

Aakash-kaushik commented 3 years ago

So, on my local i still get the same error when i download the models:

 aakash  ⎇  mobilenet  ~/models/build   ./bin/models_test -s PreTrainedResNetModelTest
Filters: PreTrainedResNetModelTest
Downloading resnet18.bin to /home/aakash/.cache/mlpack/models/weights/resnet/resnet18.bin
############################################################################################################################################################################################################ 100.0%

error: arma::memory::acquire(): out of memory

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
models_test is a Catch v2.13.3 host application.
Run with -? for options

-------------------------------------------------------------------------------
PreTrainedResNetModelTest
-------------------------------------------------------------------------------
/home/aakash/models/tests/ffn_model_tests.cpp:154
...............................................................................

/home/aakash/models/tests/ffn_model_tests.cpp:154: FAILED:
due to unexpected exception with message:
  std::bad_alloc

===============================================================================
test cases: 1 | 1 failed
assertions: 1 | 1 failed

and not the one thrown by the CI or the one on your system.

zoq commented 3 years ago

You build against git clone --depth 1 -b depthwise https://github.com/Aakash-kaushik/mlpack.git right, no other changes? I tested different commits and I get the same error I referenced above. But since the mobilenet model runs fine, let's merge this one and solve the issue in another PR.

Aakash-kaushik commented 3 years ago

You build against git clone --depth 1 -b depthwise https://github.com/Aakash-kaushik/mlpack.git right, no other changes? I tested different commits and I get the same error I referenced above. But since the mobilenet model runs fine, let's merge this one and solve the issue in another PR.

Yup that's the branch. Sure I am going to debug this and open a new PR for this.

Aakash-kaushik commented 3 years ago

Hey @zoq, we changed the resnet architecture after we created the models, but do you think that would affect this? because these bin files are like complete architecture/weights in itself and don't rely on the architecture, right ?

zoq commented 3 years ago

Hey @zoq, we changed the resnet architecture after we created the models, but do you think that would affect this? because these bin files are like complete architecture/weights in itself and don't rely on the architecture, right ?

Yes, but there is something we are missing, so maybe it's a good idea to export the model again just to make sure that's not the problem.

Aakash-kaushik commented 3 years ago

Hey @zoq, we changed the resnet architecture after we created the models, but do you think that would affect this? because these bin files are like complete architecture/weights in itself and don't rely on the architecture, right ?

Yes, but there is something we are missing, so maybe it's a good idea to export the model again just to make sure that's not the problem.

I have somethings in mind let's discuss them in the meet that we are having right now.

Aakash-kaushik commented 3 years ago

@zoq, @kartikdutt18 the resnet issue seems to be because of the fact that we changed the architecture afterwards maybe, btw does anyone remember what we used to enable the log input ? i by mistake deleted that file and don't remeber it anymore.

zoq commented 3 years ago

@zoq, @kartikdutt18 the resnet issue seems to be because of the fact that we changed the architecture afterwards maybe, btw does anyone remember what we used to enable the log input ? i by mistake deleted that file and don't remeber it anymore.

Do you mean:

mlpack::Log::Info.ignoreInput = false;
mlpack::Log::Warn.ignoreInput = false;
Aakash-kaushik commented 3 years ago

@zoq, @kartikdutt18 the resnet issue seems to be because of the fact that we changed the architecture afterwards maybe, btw does anyone remember what we used to enable the log input ? i by mistake deleted that file and don't remeber it anymore.

Do you mean:

mlpack::Log::Info.ignoreInput = false;
mlpack::Log::Warn.ignoreInput = false;

No no, the resnet architecture itself. I believe it's just an extra identity layer but, it still worked after that change before. I am not really sure, i am going to post the new models today.

Aakash-kaushik commented 3 years ago

@zoq, @kartikdutt18 the resnet issue seems to be because of the fact that we changed the architecture afterwards maybe, btw does anyone remember what we used to enable the log input ? i by mistake deleted that file and don't remeber it anymore.

Do you mean:

mlpack::Log::Info.ignoreInput = false;
mlpack::Log::Warn.ignoreInput = false;

No no, the resnet architecture itself. I believe it's just an extra identity layer but, it still worked after that change before. I am not really sure, i am going to post the new models today.

Oh totally sorry, yes i meant this. I believe i wasn't fully up when i replied this.

zoq commented 3 years ago

https://dev.azure.com/mlpack/mlpack/_build/results?buildId=6993&view=results

Aakash-kaushik commented 3 years ago

https://dev.azure.com/mlpack/mlpack/_build/results?buildId=6993&view=results

Thanks for running it, btw they are queued for now.

zoq commented 3 years ago

Right, we have to wait a little bit.

Aakash-kaushik commented 3 years ago

all the tests passed except this one on windows.

-------------------------------------------------------------------------------
PathExistsTest
-------------------------------------------------------------------------------
D:\a\1\s\tests\utils_tests.cpp(52)
...............................................................................

D:\a\1\s\tests\utils_tests.cpp(55): FAILED:
  REQUIRE( Utils::PathExists("./../../tests/CMakeLists.txt") == true )
with expansion:
  false == true

I think this is just because of the relative path, do we want to make any changes specific to windows as now we don't run tests from bin or tests in windows, this test fails.