Cannot convert model that uses float64

vladmandic commented 4 years ago

As subject line says, tfjs_graph_converter fails when converting model with numpy.float64.

TensorFlow.js Graph Model Converter

Graph model:    openimages-ssd-mobilenet-v2
Output:         openimages-ssd-mobilenet-v2.pb
Target format:  tf_frozen_model

Converting.... Traceback (most recent call last):
  File "/home/vlado/.local/bin/tfjs_graph_converter", line 8, in <module>
    sys.exit(pip_main())
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/converter.py", line 202, in pip_main
    main([' '.join(sys.argv[1:])])
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/converter.py", line 213, in main
    convert(argv[0].split(' '))
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/converter.py", line 181, in convert
    api.graph_model_to_frozen_graph(args.input_path, args.output_path)
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/api.py", line 349, in graph_model_to_frozen_graph
    graph = load_graph_model(model_dir)
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/api.py", line 285, in load_graph_model
    graph, _ = load_graph_model_and_signature(model_dir)
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/api.py", line 270, in load_graph_model_and_signature
    return _convert_graph_model_to_graph(model_json, model_path)
  File "/home/vlado/.local/lib/python3.8/site-packages/tfjs_graph_converter/api.py", line 240, in _convert_graph_model_to_graph
    weight_list = read_weights(weights_manifest, base_path, flatten=True)
  File "/home/vlado/.local/lib/python3.8/site-packages/tensorflowjs/read_weights.py", line 75, in read_weights
    return decode_weights(weights_manifest, data_buffers, flatten=flatten)
  File "/home/vlado/.local/lib/python3.8/site-packages/tensorflowjs/read_weights.py", line 188, in decode_weights
    value, offset = _deserialize_string_array(data_buffer, offset, shape)
  File "/home/vlado/.local/lib/python3.8/site-packages/tensorflowjs/read_weights.py", line 107, in _deserialize_string_array
    for _ in range(size):
TypeError: 'numpy.float64' object cannot be interpreted as an integer

Model in question is https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1

patlevin commented 4 years ago

This issue turned out to be quite the rabbit hole. First, the error has nothing to do with float64 values being used anywhere in the model - that's a red herring. The problem occurs because TensorflowJS (-converter) cannot handle 0-rank (string-) tensors (aka scalars) at the moment and thus produces an invalid weight manifest for models that contain them.

In short, this is an error with the TensorflowJS converter that I cannot fix in this converter (I could submit a PR to the TFJS team, ~~but I don't know how they handle 0-rank tensors internally~~ 0-rank tensors are handled just fine outside the converter).

To be perfectly honest, though, I don't really see a reason to fix this at the moment, since the model in question is a TF model to begin with and isn't even supported by TFJS, which brings me to the second issue.

The only way to even generate the (broken) TFJS graph model is to use the --skip_op_checks flag, which turns off model validation and allows converting models that contain unsupported features (here: HashTableV2 and LookupTableFindV2).

What's happening here, anyway?

Looking at the error location in the tensorflowjs package reveals that this problem occurs when tensorflowjs is trying to determine the size of a tensor given its shape [1] using numpy.prod().

The documentation reveals, that

The product of an empty array is the neutral element 1:
>>>numpy.prod([])
1.0

This solves the mystery of our TypeError and the mention of numpy.float64 - the shape attribute of the tensor is empty, resulting in a float scalar instead of an integer size. Examining the model.json reveals that there are indeed quite a few tensors with empty shapes:

> grep -o '"shape": \[\],' model.json | wc -l
4358

How does this happen?

The TensorflowJS converter loads the model graph, runs some optimisations on it and finally removes all tensor data (e.g. constants and model weights) before saving it as JSON.

The removed tensor data is stored separately and in chunks. During this process, the converter turns TF tensors into numpy arrays and something interesting happens.

When writing the weights manifest [2], the converter writes the shape of the tensor by using ndarray's shape-attribute [3]. So far so good, but why is the shape of the ndarray empty? Let's see what numpy's documentation [4] has to say about it:

In general, numerical data arranged in an array-like structure in Python can be converted to arrays through the use of the array() function. [...] A simple way to find out if the object can be converted to a numpy array using array() is simply to try it interactively and see if it works! (The Python Way).

So let's do exactly that:

>>>import numpy as np
>>>a = np.array([])
>>>a.shape
(0,)

Hm, close, but no cigar. Passing an empty array doesn't work, since it will generate a 1-d array of size zero, which is different from an empty shape. What about numerical data that is not arranged in an array-like structure in Python, like, say a scalar value?

>>>a = np.array(42)
>>>a.shape
()
>>>a.size
1
>>>a.item(-1)
42

Bingo! That's it - if we pass a scalar (i.e. a rank-0 tensor) into numpy, the resulting object has an empty shape attached to it while still holding the given value. This means that models which contain rank-0 tensors won't convert properly to TFJS and generate broken output that cannot be read by the TFJS Python library (though it's handled correctly in TFJS and could be used in the browser or with nodejs).

How could this be fixed?

Given a model that's actually supported by TJFS 😉, the converter would work correctly, if we'd change the offending line [5] as follows:

  for _ in range(int(size)):

Why not avoid writing empty shape-attributes in the weight manifest in the first place? Well, technically any changes there would be wrong, since a shape of (1,) would be a rank-1 tensor and not a scalar. (0,) wouldn't help either, since the tensor isn't empty. So an empty shape is the correct way to express this, it just needs to be handled correctly by the TFJS converter. Interestingly enough, numeric scalars are decoded correctly and the problem only occurs with strings.

This is where our little journey ends, since applying this little fix only opens another trapdoor:

 ...
 File "███████████████/tensorflowjs/read_weights.py", line 186, in decode_weights
         raise NotImplementedError('Unsupported data type: %s' % dtype)
NotImplementedError: Unsupported data type: bool

Turns out, while TFJS supports bool, their Python converter doesn't. Now while that would be a worthwhile fix, the model in question uses a uint64 output, which is not supported by TFJS...

vladmandic commented 4 years ago

Great analysis!

Btw, originally I did have issues with openimages-ssd-mobilenet-v2 model in TFJS (after converting it to tfjs_graph_model). Those issues are solved in https://github.com/tensorflow/tfjs/issues/3823 (committed to master, not yet in a release) And the model works perfectly (and accurately) in TFJS - I cannot understand how if uint64 is not supported in TFJS.

patlevin commented 4 years ago

And the model works perfectly (and accurately) in TFJS - I cannot understand how if uint64 is not supported in TFJS.

Well, you've said it yourself:

Those issues are solved in tensorflow/tfjs#3823 (committed to master, not yet in a release)

I only test things from the point of view of a user who installed this utility via pip. Anything that hasn't been released just yet, I cannot test or consider. This means that even if I fix things on my end, anyone who does a pip install -U tfjs_graph_converter won't benefit from those fixes as long as the TFJS team hasn't released their changes.

To confirm this, I compared the current git version to the latest (as of time of writing this) v2.3.0 and found that the converter has indeed been updated. So until the tensorflowjs team drops another release, there's little I can do.

vladmandic commented 4 years ago

The previous issue with the model was due to model pushing pushing weight tensor into TensorArray which then gets disposed, fix was to keep track of it instead of disposing.

Regarding uint64, my guess is that it's just a choice and model values don't really need it - it gets clipped and still works just fine.

But totally understand this is outside of the scope of what can be done here.

elivasquezhdz commented 4 years ago

I have a somewhat "similar" problem trying to convert the toxicity model: https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1 I wonder if this is related `tfjs_graph_converter tftox model.pb TensorFlow.js Graph Model Converter

Graph model: tftox Output: model.pb Target format: tf_frozen_model

Converting.... Error: Input 1 of node module_apply_default/Encoder_en/KonaTransformer/ClipToMaxLength/Less was passed int32 from Const_56:0 incompatible with expected int64.`

patlevin commented 4 years ago

@elivasquezhdz I opened a new issue for this problem. I assume you meant this model: toxicity.

The problem here is that indeed the TFJS model holds all its intXX tensors converted to int32, while the actual processing nodes in the graph still expect int64 input. This is strictly speaking an error with TFJS as they simply adjust the data type during import.

TF, however, doesn't do any such magic tricks behind the scenes and processes graph nodes and their type annotations as-is. I'll consider this to be just another quirk and will add a work-around.

elivasquezhdz commented 4 years ago

Yes! that's the one I meant, thanks for opening the issue and letting me know!

patlevin commented 4 years ago

@elivasquezhdz The toxicity-model issue is fixed now. I also ported the TFJS inference code to python for testing the results. If you're interested in that let me know and I'll add a repo for it.

vladmandic commented 4 years ago

@elivasquezhdz The toxicity-model issue is fixed now. I also ported the TFJS inference code to python for testing the results. If you're interested in that let me know and I'll add a repo for it.

Can you post a link to your patch? I'm having a similar issue where a model expects uint8 and TFJS passes int32 and i'd like to verify before opening another issue.

patlevin commented 4 years ago

@vladmandic Maybe I should clarify this a little. I only ported the code for using the toxicity-model from TypeScript to Python in order to verify that the converted model still works as intended.

I didn't touch any of the TFJS innards, just ported the model specific code to test it out.

vladmandic commented 4 years ago

@vladmandic Maybe I should clarify this a little. I only ported the code for using the toxicity-model from TypeScript to Python in order to verify that the converted model still works as intended.

I didn't touch any of the TFJS innards, just ported the model specific code to test it out.

Thanks for clarification.

I've created an issue on TFJS to track uint8 stuff - basically, TFJS maps any int value to int32 during load and if model is strict, it will later fail on execution.

patlevin commented 4 years ago

I fixed the Python converter in TFJS and the changes got merged into the master branch. The next TFJS release will contain the fix.

patlevin / tfjs-to-tf