onnx / onnx-mlir

Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure
Apache License 2.0
740 stars 314 forks source link

Model zoo tests regression #1470

Open gongsu832 opened 2 years ago

gongsu832 commented 2 years ago

@tungld I was testing my model zoo build and noticed that the latest commit has 24 failed tests, 2 more than what you reported on https://github.com/onnx/onnx-mlir/issues/128#issuecomment-1128755672. The 2 tests are squeezenet1.1-7 and vgg16-bn-7. They both failed with the same error Field number 0 is illegal (only showing squeeznet1.1-7):

[squeezenet1.1-7] Traceback (most recent call last):
  File "RunONNXModel.py", line 404, in <module>
    main()
  File "RunONNXModel.py", line 240, in main
    model = onnx.load(args.model_path)
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 121, in load_model
    model = load_model_from_string(s, format=format)
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 158, in load_model_from_string
    return _deserialize(s, ModelProto())
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 99, in _deserialize
    decoded = cast(Optional[int], proto.ParseFromString(s))
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.20.1-py3.8.egg/google/protobuf/message.py", line 202, in ParseFromString
    return self.MergeFromString(serialized)
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.20.1-py3.8.egg/google/protobuf/internal/python_message.py", line 1128, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.20.1-py3.8.egg/google/protobuf/internal/python_message.py", line 1178, in InternalParse
    raise message_mod.DecodeError('Field number 0 is illegal.')
google.protobuf.message.DecodeError: Field number 0 is illegal.

I tried both protobuf 3.14.0 and 3.20.1 and the results are the same.

tungld commented 2 years ago

I quickly tried squeezenet1.1-7 on a Z machine, and the model passed:

$ VERBOSE=1 ONNX_MLIR_HOME=/home/tungld/dl/onnx-mlir/build/Debug python CheckONNXModelZoo.py -m squeezenet1.1-7 -compile_args="-O3 --mcpu=z14"

There are 155 models in the ONNX model zoo where 31 models are not checked because of old opsets or quantization.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Downloading https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.tar.gz
Extracting the .tag.gz to /tmp/tmpj53qm9hx
Checking the model squeezenet1.1-7 ...
[squeezenet1.1-7] Temporary directory has been created at /tmp/tmp73w9upo6
Reading inputs from /tmp/tmpj53qm9hx/squeezenet1.1/test_data_set_2 ...
  - 1st input: [1x3x224x224xfloat32]
  done.

Compiling the model ...
/home/tungld/dl/onnx-mlir/build/Debug/bin/onnx-mlir -O3 --mcpu=z14 /tmp/tmp73w9upo6/model.onnx
  took  5.614736581221223  seconds.

Loading the compiled model ...
  took  0.00034265127032995224  seconds.

Running inference ...
  took  0.514042291790247  seconds.

Reading reference outputs from /tmp/tmpj53qm9hx/squeezenet1.1/test_data_set_2 ...
  - 1st output: [1x1000xfloat32]
  done.

Verifying value of squeezenet0_flatten0_reshape0:[1, 1000] using atol=0.01, rtol=0.05 ...
  correct.

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    8.8s finished
1 models tested: squeezenet1.1-7

1 models passed: squeezenet1.1-7

And this it protobuf:

protoc --version
libprotoc 3.17.1

It seemed that the error was related to protobuf version.

gongsu832 commented 2 years ago

Not quite for me. I actually have a build system with

# protoc --version
libprotoc 3.12.4

and I'm on commit 8e54f577d05ebd271f4f8b23ac9c3adfe682480e. This system gets only 21 failures. BTW, are you on z?

tungld commented 2 years ago

BTW, are you on z?

Yes, I am. I am using the last commit.

Looking at this in your log:

[squeezenet1.1-7] Traceback (most recent call last):
  File "RunONNXModel.py", line 404, in <module>
    main()
  File "RunONNXModel.py", line 240, in main
    model = onnx.load(args.model_path)`;

It failed at a very early stage when loading the .onnx file using onnx (not onnx-mlir). It should be something related to the onnx or protobuf package.

gongsu832 commented 2 years ago

Can you try this in our docker dev image? It gets 24 failures.

docker pull onnxmlirczar/onnx-mlir-dev
docker run --rm -ti onnxmlirczar/onnx-mlir-dev

Inside the container,

apt-get update && apt-get install wget
pip3 install joblib
git clone https://github.com/onnx/models
cd models
ln -sf ../onnx-mlir/utils/RunONNXModel.py
ln -sf ../onnx-mlir/test/onnx-model-zoo/CheckONNXModelZoo.py
VERBOSE=2 ONNX_MLIR_HOME=/workdir/onnx-mlir/build/Debug python3 CheckONNXModelZoo.py -m squeezenet1.1-7

Just tried protobuf 3.12.4, 3.14.0, 3.17.1, and 3.20.1, all fail.

tungld commented 2 years ago

@gongsu832 did what you suggested using our docker dev image, and I got this:

root@95a4aa53b769:/workdir/models# VERBOSE=2 ONNX_MLIR_HOME=/workdir/onnx-mlir/build/Debug python3 CheckONNXModelZoo.py -m squeezenet1.1-7
find . -mindepth 2 -type f -name *.tar.gz

There are 155 models in the ONNX model zoo where 31 models are not checked because of old opsets or quantization.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Downloading https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.tar.gz
wget --no-check-certificate --timestamping https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.tar.gz
Extracting the .tag.gz to /tmp/tmp5sjkx8tt
tar -xzvf ./squeezenet1.1-7.tar.gz -C /tmp/tmp5sjkx8tt
find /tmp/tmp5sjkx8tt -type f -name *.onnx
find /tmp/tmp5sjkx8tt -type d -name test_data_set*
Checking the model squeezenet1.1-7 ...
python RunONNXModel.py /tmp/tmp5sjkx8tt/squeezenet1.1/squeezenet1.1.onnx --compile_args=-O3 --verify=ref --data_folder=/tmp/tmp5sjkx8tt/squeezenet1.1/test_data_set_2
[squeezenet1.1-7] Temporary directory has been created at /tmp/tmpc0de48vj
Reading inputs from /tmp/tmp5sjkx8tt/squeezenet1.1/test_data_set_2 ...
  - 1st input: [1x3x224x224xfloat32]
  done.

Compiling the model ...
/workdir/onnx-mlir/build/Debug/bin/onnx-mlir -O3 /tmp/tmpc0de48vj/model.onnx
  took  3.346841878257692  seconds.

Loading the compiled model ...
  took  0.00033509451895952225  seconds.

Running inference ...
  took  0.7248765854164958  seconds.

Reading reference outputs from /tmp/tmp5sjkx8tt/squeezenet1.1/test_data_set_2 ...
  - 1st output: [1x1000xfloat32]
  done.

Verifying value of squeezenet0_flatten0_reshape0:[1, 1000] using atol=0.01, rtol=0.05 ...
  correct.

rm ./squeezenet1.1-7.tar.gz
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.1s finished
1 models tested: squeezenet1.1-7

1 models passed: squeezenet1.1-7

root@95a4aa53b769:/workdir/models# protoc --version
libprotoc 3.14.0

root@95a4aa53b769:/workdir/models# pip3 show protobuf
Name: protobuf
Version: 3.14.0
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/
Author: None
Author-email: None
License: 3-Clause BSD License
Location: /usr/local/lib/python3.8/dist-packages/protobuf-3.14.0-py3.8.egg
Requires: six
Required-by: onnx

I didn't see any error.

gongsu832 commented 2 years ago

Now this is very weird.

root@bda611eb4b38:/workdir/models# VERBOSE=2 ONNX_MLIR_HOME=/workdir/onnx-mlir/build/Debug python3 CheckONNXModelZoo.py -m squeezenet1.1-7
find . -mindepth 2 -type f -name *.tar.gz

There are 155 models in the ONNX model zoo where 31 models are not checked because of old opsets or quantization.
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Downloading https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.tar.gz
wget --no-check-certificate --timestamping https://github.com/onnx/models/raw/main/vision/classification/squeezenet/model/squeezenet1.1-7.tar.gz
Extracting the .tag.gz to /tmp/tmpld_si8ea
tar -xzvf ./squeezenet1.1-7.tar.gz -C /tmp/tmpld_si8ea
find /tmp/tmpld_si8ea -type f -name *.onnx
find /tmp/tmpld_si8ea -type d -name test_data_set*
Checking the model squeezenet1.1-7 ...
python RunONNXModel.py /tmp/tmpld_si8ea/squeezenet1.1/._squeezenet1.1.onnx --compile_args=-O3 --verify=ref --data_folder=/tmp/tmpld_si8ea/squeezenet1.1/test_data_set_0
[squeezenet1.1-7] Traceback (most recent call last):
  File "RunONNXModel.py", line 404, in <module>
    main()
  File "RunONNXModel.py", line 240, in main
    model = onnx.load(args.model_path)
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 121, in load_model
    model = load_model_from_string(s, format=format)
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 158, in load_model_from_string
    return _deserialize(s, ModelProto())
  File "/usr/local/lib/python3.8/dist-packages/onnx-1.11.0-py3.8-linux-s390x.egg/onnx/__init__.py", line 99, in _deserialize
    decoded = cast(Optional[int], proto.ParseFromString(s))
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.14.0-py3.8.egg/google/protobuf/message.py", line 199, in ParseFromString
    return self.MergeFromString(serialized)
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.14.0-py3.8.egg/google/protobuf/internal/python_message.py", line 1145, in MergeFromString
    if self._InternalParse(serialized, 0, length) != length:
  File "/usr/local/lib/python3.8/dist-packages/protobuf-3.14.0-py3.8.egg/google/protobuf/internal/python_message.py", line 1195, in InternalParse
    raise message_mod.DecodeError('Field number 0 is illegal.')
google.protobuf.message.DecodeError: Field number 0 is illegal.

rm ./squeezenet1.1-7.tar.gz
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s finished
1 models tested: squeezenet1.1-7

0 models passed:

1 models failed: squeezenet1.1-7

root@bda611eb4b38:/workdir/models# protoc --version
libprotoc 3.14.0
root@bda611eb4b38:/workdir/models# pip3 show protobuf
Name: protobuf
Version: 3.14.0
Summary: Protocol Buffers
Home-page: https://developers.google.com/protocol-buffers/
Author: None
Author-email: None
License: 3-Clause BSD License
Location: /usr/local/lib/python3.8/dist-packages/protobuf-3.14.0-py3.8.egg
Requires: six
Required-by: onnx