microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.24k stars 2.87k forks source link

Reshape `requested_shape` forced to have leading dimension 1 when it should be -1 #6424

Open nietras opened 3 years ago

nietras commented 3 years ago

Describe the bug Trying to run models with dynamic leading dimension (batch size) fails with ONNX runtime as soon as batch size != 0. E.g. when set to 2. I have tried this with many models, and isolated it to also occur on the attached simple mnist model see below.

mnist-8-dynamic-leading-dimension.zip

The problem is as soon as one tries to call Run with this model with a leading dimension of 2 it fails as shown below:

 [E:onnxruntime:, sequential_executor.cc:334 onnxruntime::SequentialExecutor::Execute] 
  Non-zero status code returned while running Reshape node. Name:'Times212_reshape0' 
  Status Message: D:\oss\onnxruntime\onnxruntime\core/providers/cpu/tensor/reshape_helper.h:43 
  onnxruntime::ReshapeHelper::ReshapeHelper gsl::narrow_cast<int64_t>(input_shape.Size()) == size was false. 
    The input tensor cannot be reshaped to the requested shape. 
    Input shape:{2,16,4,4}, requested shape:{1,256}

From the code for this (see https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/core/providers/cpu/tensor/reshape_helper.h) it appears that requested_shape is forced to being set to {1, 256} when it should be {-1, 256} and I don't know why.

Urgency I cannot run with batch size other than 1 due to this.

System information

To Reproduce

Expected behavior Running with any batch size works when model has dynamic batch size.

snnn commented 3 years ago

Hi @nietras , "-1" means "unknown" or "any". But when you run the model with inputs, shapes of every tensor should be known. And in this case, it is {2,16,4,4}, which has 512 elements. And it can't be reshaped to {1,256}.

nietras commented 3 years ago

@snnn the whole point is that the leading dimension is any so you can run with a batch size of 2, how are you then supposed to run with different batch sizes?

That's also how models are authored for training and then exported.

nietras commented 3 years ago

@snnn requested_shape is the shape of an output tensor, not yet having a size determined, and which has been set to {-1,256} which the whole reshape code clearly is supposed to handle, again isn't that the whole point?

Saying shapes of all tensors should be known contradicts the code of the reshape method, doesn't it? :)

snnn commented 3 years ago

I believe the model is wrong.

The model has the following input:

name: "Input3"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_value: -1
      }
      dim {
        dim_value: 1
      }
      dim {
        dim_value: 28
      }
      dim {
        dim_value: 28
      }
    }
  }
}

Internally, onnxruntime treats -1 as unknown, or dynamic size. But ONNX doesn't do so. dim_value shouldn't be "-1". Please read https://github.com/onnx/onnx/blob/master/docs/IR.md#static-tensor-shapes

The model doesn't conform ONNX standard. It is a likely a problem of CNTK exporter. However, as it is already in maintenance mode, it's hard for you to get a fix. I suggest you considering some other trainers instead. For example, the mnist model exported from TF works pretty well. https://github.com/tensorflow/models/tree/master/official/vision/image_classification#mnist

snnn commented 3 years ago

The mnist model in ONNX model zoo was converted from CNTK, it looks like the same as yours except it doesn't support mini batch. If you need to the support, please try to contact the model owner(https://github.com/onnx/models), he/she may provide better help and knows it better than me.

I'll close this issue because I think it is a converter issue, not onnxruntime issue.

nietras commented 3 years ago

@snnn this is not helpful. As I wrote in issue:

I have tried this with many models,

This is not just this specific model. I just chose that since it was small and hence easy to use. You can set the leading dimension to have dim_param = "None", "Whatever" or just not be set, as is also clearly defined in ONNX format https://github.com/onnx/onnx/blob/master/docs/IR.md:

Each size in the list MAY be expressed as an integral value or as a "dimension variable," a string denoting that the actual size of the dimension is not statically constrained to a particular number. This is useful for declaring interfaces that care about the number of dimensions, but not the exact size of each dimension. A dimension MAY have neither dim_value nor dim_param set. Such a dimension represents an unknown dimension unrelated to other unknown dimensions.

This just had -1 since that seemed to match best what ONNX runtime expected, but you can set it to "None" or "neither dim_value nor dim_param set". And it still does not work. This is not a converter issue. Now this issue gets closed without answering any of my questions:

Please give an example of a model file then that works for running with different batch sizes?

pranav-prakash commented 3 years ago

@nietras There's an example of an MNIST conv model with variable batch size at orttraining/tools/mnist_model_builder/mnist_conv_builder.ipynb

which I've successfully tested to work fine. And that formats the unknown dimension as

, name: "T6"
type {
  tensor_type {
    elem_type: 1
    shape {
      dim {
        dim_value: -1
      }
      dim {
        dim_value: 16
      }
      dim {
        dim_value: 4
      }
      dim {
        dim_value: 4
      }
    }
  }
}

I'm not sure why your particular case doesn't work, since the reshape op does indeed handle the case with unknown dimension. Maybe try setting a breakpoint within there?