Error: Operands could not be broadcast together with shapes 1,150,150,32 and 0.

msektrier commented 5 years ago

To get help from the community, we encourage using Stack Overflow and the tensorflow.js tag.

TensorFlow.js version

tjfs version : 1.0.1 but happens also with 0.8.5

Browser version

Chrome Version 74.0.3729.131 (Offizieller Build) (64-Bit)

Describe the problem or feature request

We simplified the standard ssdlite_mobilenet_v2 model from tensorflow zoo repository. Tfjs converter transformation is successful using Tfjs converter 1.0.1 and 0.8.5. Moreover, in Python we can do inferences with this model.

ssdlite_v2_1class (tfjs converter 085 not working).zip

custom_ssdlite_mobilenetv2_1class.zip

When we try to apply it in JavaScript we an error as stated below.

Code to reproduce the bug / link to feature request

Loading it in Javascript with for the model from converter version 1.0.1:

<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.1.2/dist/tf.min.js"></script>

    const modelUrl = 'tf/model.json';
    const INPUT_TENSOR = 'image_tensor';
    const OUTPUT_TENSOR = 'detection_boxes'; 
    const model = await tf.loadGraphModel(modelUrl);

    const dummy = tf.zeros([1, 224, 224, 3], 'int32');
    //let preparedImage = preprocess(img);
  let test = await model.executeAsync( {[INPUT_TENSOR]: dummy}, OUTPUT_TENSOR);

will result in chrome and firefox in

broadcast_util.ts:81 Uncaught (in promise) Error: Operands could not be broadcast together with shapes 1,150,150,32 and 0. at Ur (broadcast_util.ts:81) at new ii (batchnorm_packed_gpu.ts:32) at e.batchNormalization (backend_webgl.ts:926) at Ho.It.runKernel.$x (batchnorm.ts:346) at engine.ts:441 at t.scopedRun (engine.ts:383) at t.runKernel (engine.ts:438) at Ho (batchnorm.ts:345) at batchNorm (operation.ts:45) at Rv (normalization_executor.ts:31)

msektrier commented 5 years ago

Do you need further input from my side or can I support?

pyu10055 commented 5 years ago

@dsmilkov do you mind taking a look at this error?

dsmilkov commented 5 years ago

Thanks! Will take a look today

dsmilkov commented 5 years ago

@msektrier Did you edit the model.json in any way? The problem stems from the fact that the mean parameter in ops like FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm has an invalid shape [0]. The tensor with the invalid shape is FeatureExtractor/MobilenetV2/expanded_conv_15/expand/BatchNorm/Const.

In python, if I try to provide a mean and variance with shape [0] in tf.nn.batch_normalization:

tf.nn.batch_normalization(
    tf.zeros([1, 150, 150, 32], 'float32'), # x
    tf.constant([], 'float32', [0]), # mean
    tf.constant([], 'float32', [0]), # variance
    offset=None,
    scale=None,
    variance_epsilon=1e-7    
)

I get an error Incompatible shapes: [1,150,150,32] vs. [0] [Op:Mul] name: batchnorm/mul/.

Can you share the saved model (.pb file) that you used to convert the model?

dsmilkov commented 5 years ago

So I found another signal that points to the cause. Looking at the FusedBatchNorm ops in model.json

{
  "input": [
    "FeatureExtractor/MobilenetV2/Conv/Conv2D",
    "FeatureExtractor/MobilenetV2/Conv/BatchNorm/gamma",
    "FeatureExtractor/MobilenetV2/Conv/BatchNorm/beta",
    "FeatureExtractor/MobilenetV2/expanded_conv_15/expand/BatchNorm/Const",
    "FeatureExtractor/MobilenetV2/expanded_conv_15/expand/BatchNorm/Const"
  ],
  "attr": {
    "epsilon": {
      "f": 0.0010000000474974513
    },
    "T": {
      "type": "DT_FLOAT"
    },
    "is_training": {
      "b": true
    },
    "data_format": {
      "s": "TkhXQw=="
    }
  },
  "name": "FeatureExtractor/MobilenetV2/Conv/BatchNorm/FusedBatchNorm",
  "op": "FusedBatchNorm"
},

is_training is set to True, but it should be False since you are trying to do inference. Make sure you provide the correct serving tags when exporting the model using our converter tool (--signature_name=serving_default --saved_model_tags=serve).

To see which signatures and tags exist , you can use the Saved Model CLI

Techniques like batchnorm and dropout require different graphs for inference vs training. See https://github.com/tensorflow/hub/issues/147#issuecomment-419929574 for more info.

msektrier commented 5 years ago

Dear dsmilkov,

I just executed the whole transformation again. I uploaded the saved model, frozen model, and the converted json model to https://share.infinkon.de/index.php/s/WYnXjG85cj4AZnp (44MBytes). The error remains.

According to our ML engineer she used the Tensorflow object detection API and she isn't aware about changes that lead to such problems.

Honestly, I cannot judge since I just want to do the inference :). I have just tried for testing purposes to execute the frozen and saved model with Tensorflow 2.0.0-alpha with python and failed. Unfortunately, the documentation of the API is not telling me how I can load the graph, transfer it to a model and do the inference.

I do really appreciate your help and your efforts a lot and it helps me to lower my disappointments about bad, missing or not up-to-date documentations.

Regarding your hint with the Saved Model CLI. I executed "show" and "scan". signatures are not showing and the tags are default I would say with "serve".

Many thanks for your support.

msektrier commented 5 years ago

Hello dsmilkov,

today, my engineer gave me a new converted model and this one runs....with further problems.

ssdlite_mobilenetV2_1class.zip - model.json only.

For the inference we have assured that the input tensor loaded in HTML is exactly the same like in our python environment. We load it simply by

let tensor = tf.browser.fromPixels(img);
expanded = tensor.expandDims(0);
return expanded

(in TFJS)

tf_image_string = tf.io.read_file(PATH_TO_IMAGE)
tf_image = tf.image.decode_jpeg(tf_image_string, channels=3, dct_method='INTEGER_ACCURATE')
tf_image_expanded1 = tf.expand_dims(tf_image, axis=0)

(in python)

...they are equal in size, shape and values.

Then we output the "detection_boxes" tensor and it gives us a different result in python and TFJS. Even more strange the first values are identical and then it goes crazy.

The results from python

[[[0.89474267 0.16380781 0.94830686 0.20310044] [0. 0.7445791 0.17719875 1. ] [0.64993775 0. 0.84936297 0.3427423 ] [0.40547025 0.22818083 0.895257 0.46889073] [0.6517975 0.5531899 0.8523899 1. ] [0.6154277 0.8415282 1. 1. ] [0.46409145 0. 1. 0.15208271] [0.83050364 0.6987988 1. 1. ] [0.14641336 0.02349633 0.3389939 0.6332496 ] [0.5142635 0.728901 0.9893209 0.9674072 ] [0.54869866 0.14797944 0.7521107 0.75144666] [0.4098649 0.13039494 0.8977258 0.37028956] [0. 0.357253 0.14110368 0.9779241 ] [0.65561855 0.648258 1. 0.84357405] [0.840054 0. 1. 0.31862792] [0.68168664 0.7677454 1. 1. ] [0.44844124 0.15256423 0.65252775 0.7499266 ] [0.62348706 0.50179255 0.875133 0.99561524] [0.74397695 0.6874038 0.94830465 1. ] [0.68101573 0.24813068 1. 0.4446543 ] [0.21217245 0.72950757 0.69929665 0.97...

the results in TJFS (done with arraySync())

: (4) [0.8947426080703735, 0.16380780935287476, 0.9483069181442261, 0.20310050249099731] 1: (4) [0.5229350328445435, 0.5015679001808167, 1, 1] 2: (4) [0.27248457074165344, 0.10805338621139526, 0.7271230220794678, 0.9670282006263733] 3: (4) [0.6002633571624756, 0.38063299655914307, 1, 1] 4: (4) [0, 0.31815987825393677, 1, 0.691149890422821] 5: (4) [0.3790368139743805, 0.6115804314613342, 1, 1] 6: (4) [0.12581798434257507, 0.1382976472377777, 0.8695082664489746, 0.8439497947692871] 7: (4) [0.3165704607963562, 0, 0.6834971308708191, 1] 8: (4) [0, 0.5953051447868347, 0.6202481389045715, 1] 9: (4) [0, 0.3122354745864868, 0.3565589487552643, 1] 10: (4) [0, 0.7331592440605164, 0.2514439523220062, 1] 11: (4) [0.18738055229187012, 0.5012880563735962, 0.8259190320968628, 1] 12: (4) [0, 0.845550537109375, 0.29175451397895813, 1] 13: (4) [0, 0.5361356735229492, 0.49668145179748535, 1] 14: (4) [0.05381292104721069, 0.6056756973266602, 1, 1] 15: (4) [0, 0.7074476480484009, 0.36408132314682007, 1]

As you can see the first 4 values are identical "0.89474267 0.16380781 0.94830686 0.20310044" but then they differ completely.

Do you have any indication for us with your experience?

many thanks

sandhyacs commented 5 years ago

@msektrier , I am also facing the same issue. Have you resolved this problem? i want to know why this is happening?

msektrier commented 5 years ago

@sandhyacs We have no idea and we do not investigate further. We hope some problems are fixed with the next release of TFJS. We assume these are TF/TFJS internals.

msektrier commented 5 years ago

With the yesterday's (17/06/2019) version of TFJS we receive a new error in the browser for the posted network.

However, our other ssdlite mobilenet V2 network is finally running. It is the first time we could successfully transfer this network from python to javascript. Many thanks for that!

dsmilkov commented 5 years ago

Thanks for filing a separate issue regarding the numerical precision which we will track at https://github.com/tensorflow/tfjs/issues/1652 (see a workaround in that issue).

In this current issue, I want to help figure out the runtime error you got in the latest version. Can you share with me the latest model.json, the weights, and how you call model.executeAsync()? Thanks!

msektrier commented 5 years ago

Dear dsmilkov,

I just tried to reproduce it with the given model and I cannot reproduce the error "cannot read property 'op' of undefined".

dsmilkov commented 5 years ago

Thanks, will close the issue.

mattstruble commented 5 years ago

@msektrier how did you get past your original problem of the js model raising the error Uncaught (in promise) Error: Operands could not be broadcast together with shapes 1,150,150,32 and 0.?

dsmilkov commented 5 years ago

@mattstruble Can you send us the model/steps so we can reproduce?

mattstruble commented 5 years ago

@dsmilkov thanks for the quick reply and offering to look into this!

Here is a link to the saved_model, frozen_inference_graph, and web_model.

My model was generated using the tensorflow/models/research/object_detection/model_main.py script in TF 1.13.1 using the following command:

python model_main.py --model_dir=output --pipeline_config_path=training\ssd_mobilenet_v2_coco.config -num_train_steps=200000

I then exported the inference graph.

python export_inference_graph.py --input_type=image_tensor --output_directory=output_inf --pipeline_config_path=training\ssd_mobilenet_v2_coco.config --trained_checkpoint_prefix=neg_32\model.ckpt-XXXX

Then converted it to tensorflowjs with tfjs version 1.2.2.1 (I still get the same error using 0.8.6).

tensorflowjs_converter --input_format=tf_saved_model --output_format=tfjs_graph_model --saved_model_tags=serve --signature_name=serving_default output_inf\saved_model output_inf\web_model

Then when I try running it in the web I get the broadcast shapes error.

import * as tf from '@tensorflow/tfjs';

class Detector {
    async init() {      
        try {
            this.model = await tf.loadGraphModel('/web_model/model.json');
        } catch (err) {
            console.log(err);
        }
    }

    async detect(frame) {
        const { model } = this;

        const INPUT_TENSOR='image_tensor';
        const OUTPUT_TENSOR='num_detections'
        const zeros = tf.zeros([1, 300, 300, 3]);

        console.log("executing model");
        output = await model.executeAsync({[INPUT_TENSOR]: zeros}, OUTPUT_TENSOR);
        console.log(output);
    }
}

export default Detector;

It seems to be the same issue @msektrier was having with the BatchNorm layers is_training=True. Which is why I originally asked him how he got around it, since I can't seem to find documentation on how to save the model with is_training=False.

Thanks for your support!

mattstruble commented 5 years ago

@dsmilkov not to be a bother, but I was wondering if you had any status on this issue?

Many thanks.

bibasco commented 5 years ago

@msektrier @dsmilkov, What was the solution to this problem or is there any known workaround? I'm having the same problem.

rthadur commented 5 years ago

@bibasco please open a new issue with all the details in template here . Thank you

tensorflow / tfjs