openvinotoolkit / training_extensions

Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
https://openvinotoolkit.github.io/training_extensions/
Apache License 2.0
1.14k stars 442 forks source link

Export error after training #3770

Open ip2016 opened 2 months ago

ip2016 commented 2 months ago

I'm trying to train yolox_tiny model on my image dataset with additional single category. Training and testing completes successfully but exporting fails with error "Argument 1 and 2 element types must match." I'm using otx[xpu] extension and ARC 750 GPU for training.

Steps to Reproduce

  1. Training: otx train --config recipe/detection/yolox_tiny.yaml --data_root Datasets/my-dataset --work_dir yolox-model

Epoch 15/199 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8/8 0:00:03 • 0:00:00 2.48it/s v_num: 0 train/loss_cls: 0.452 train/loss_bbox: 1.580 train/loss_obj: 0.928 train/loss: 2.960 train/data_time: 0.022 train/iter_time: 0.423 val/map: 0.692 val/map_50: 1.000 val/map_75: 1.000 val/map_small: -1.000 val/map_medium: -1.000 val/map_large: 0.692 val/mar_1: 0.720 val/mar_10: 0.720 val/mar_100: 0.720 val/mar_small: -1.000 val/mar_medium: -1.000 val/mar_large: 0.720 val/map_per_class: -1.000 val/mar_100_per_class: -1.000 val/classes: 0.000 val/f1-score: 1.000 Elapsed time: 0:01:37.700299

  1. Testing: otx test --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Test metric ┃ DataLoader 0 ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ test/classes │ 0.0 │ │ test/f1-score │ 0.8888888955116272 │ │ test/map │ 0.49603959918022156 │ │ test/map_50 │ 0.7920792102813721 │ │ test/map_75 │ 0.7920792102813721 │ │ test/map_large │ 0.49603959918022156 │ │ test/map_medium │ -1.0 │ │ test/map_per_class │ -1.0 │ │ test/map_small │ -1.0 │ │ test/mar_1 │ 0.5 │ │ test/mar_10 │ 0.5 │ │ test/mar_100 │ 0.5 │ │ test/mar_100_per_class │ -1.0 │ │ test/mar_large │ 0.5 │ │ test/mar_medium │ -1.0 │ │ test/mar_small │ -1.0 │ └───────────────────────────┴───────────────────────────┘ Testing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:07 • 0:00:00 0.00it/s Elapsed time: 0:00:30.886884

  1. Exporting: otx export --config yolox-model/20240726_144135/configs.yaml --data_root Datasets/my-dataset --checkpoint yolox-model/20240726_144135/last.ckpt

/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/detection.py:268: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! shape = (int(inputs.shape[2]), int(inputs.shape[3])) /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/detection.py:275: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results. meta_info_list = [meta_info] * len(inputs) /mnt/d/Projects/venv/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /build/pytorch/aten/src/ATen/native/TensorShape.cpp:3526.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:248: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. iou_threshold = torch.tensor([iou_threshold], dtype=torch.float32) /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:249: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. score_threshold = torch.tensor([score_threshold], dtype=torch.float32) /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/utils.py:142: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect. k = torch.tensor(k, device=input.device, dtype=torch.long) /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:387: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! score_threshold = float(score_threshold) /mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/common/utils/nms.py:388: TracerWarning: Converting a tensor to a Python float might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! iou_threshold = float(iou_threshold) /mnt/d/Projects/venv/lib/python3.10/site-packages/torch/onnx/symbolic_opset9.py:5856: UserWarning: Exporting aten::index operator of advanced indexing in opset 11 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results. warnings.warn( /mnt/d/Projects/venv/lib/python3.10/site-packages/torch/onnx/utils.py:1686: UserWarning: The exported ONNX model failed ONNX shape inference. The model will not be executable by the ONNX Runtime. If this is unintended and you believe there is a bug, please report an issue at https://github.com/pytorch/pytorch/issues. Error reported by strict ONNX shape inference: [ShapeInferenceError] (op_type:Where, node name: /Where_2): Y has inconsistent type tensor(float) (Triggered internally at /build/pytorch/torch/csrc/jit/serialization/export.cpp:1415.) _C._check_onnx_proto(proto) 2024-07-26 08:27:07,083 - root - INFO - Converting to ONNX is done.

GeneralFailure: Check 'error_message.empty()' failed at src/frontends/onnx/frontend/src/frontend.cpp:122: FrontEnd API failed with GeneralFailure: Errors during ONNX translation: While validating ONNX node '<Node(Where): /Where_2>': Check 'element::Type::merge(result_et, get_input_element_type(1), get_input_element_type(2))' failed at src/core/src/op/select.cpp:68: While validating node 'opset1::Select Select_2595 (opset1::Equal /Equal[0]:boolean[1,..200], opset1::Tile /Tile_1[0]:i64[1,..200], opset1::Constant /Constant_112[0]:f32[1]) -> (dynamic[...])' with friendly_name 'Select_2595': Argument 1 and 2 element types must match.

Traceback (most recent call last): File "/mnt/d/Projects/venv/bin/otx", line 8, in sys.exit(main()) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/init.py", line 17, in main OTXCLI() File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/cli.py", line 60, in init self.run() File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/cli/cli.py", line 531, in run fn(**fn_kwargs) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/engine/engine.py", line 585, in export exported_model_path = self.model.export( File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/algo/detection/yolox.py", line 99, in export return super().export(output_dir, base_name, export_format, precision, to_exportable_code) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/model/base.py", line 647, in export return self._exporter.export( File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/exporter/base.py", line 108, in export return self.to_openvino(model, output_dir, base_model_name, precision) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/otx/core/exporter/native.py", line 80, in to_openvino exported_model = openvino.convert_model( File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert.py", line 100, in convert_model ovmodel, = _convert(cli_parser, params, True) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 535, in _convert raise e File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 477, in _convert ov_model = driver(argv, {"conversion_parameters": non_default_params}) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 228, in driver ov_model = moc_emit_ir(prepare_ir(argv), argv) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/convert_impl.py", line 177, in prepare_ir ov_model = moc_pipeline(argv, moc_front_end) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/tools/ovc/moc_frontend/pipeline.py", line 244, in moc_pipeline ov_model = moc_front_end.convert(input_model) File "/mnt/d/Projects/venv/lib/python3.10/site-packages/openvino/frontend/frontend.py", line 18, in convert converted_model = super().convert(model) openvino._pyopenvino.GeneralFailure: Check 'error_message.empty()' failed at src/frontends/onnx/frontend/src/frontend.cpp:122: FrontEnd API failed with GeneralFailure: Errors during ONNX translation: While validating ONNX node '<Node(Where): /Where_2>': Check 'element::Type::merge(result_et, get_input_element_type(1), get_input_element_type(2))' failed at src/core/src/op/select.cpp:68: While validating node 'opset1::Select Select_2595 (opset1::Equal /Equal[0]:boolean[1,..200], opset1::Tile /Tile_1[0]:i64[1,..200], opset1::Constant /Constant_112[0]:f32[1]) -> (dynamic[...])' with friendly_name 'Select_2595': Argument 1 and 2 element types must match.

Environment:

python -c "import torch; import intel_extension_for_pytorch as ipex; print(torch.version); print(ipex.version); [print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())];"

2.1.0.post2+cxx11.abi 2.1.30+xpu [0]: _DeviceProperties(name='Intel(R) Graphics [0x56a1]', platform_name='Intel(R) Level-Zero', dev_type='gpu', driver_version='1.3.27642', has_fp64=0, total_memory=7934MB, max_compute_units=448, gpu_eu_count=448)

hwinfo --display

07: PCI 4bfb0000.0: 0302 3D controller [Created at pci.386] Unique ID: +JEX.TMx8hlOLi40 SysFS ID: /devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0004:00/VMBUS:00/035448d6-4bfb-4b22-bbe5-9a3bb13c8f15/pci4bfb:00/4bfb:00:00.0 SysFS BusID: 4bfb:00:00.0 Hardware Class: graphics card Model: "Microsoft 3D controller" Vendor: pci 0x1414 "Microsoft Corporation" Device: pci 0x008e Driver: "dxgkrnl" Driver Modules: "dxgkrnl", "dxgkrnl" Module Alias: "pci:v00001414d0000008Esv00000000sd00000000bc03sc02i00" Config Status: cfg=new, avail=yes, need=no, active=unknown

clinfo -l

Platform #0: Intel(R) FPGA Emulation Platform for OpenCL(TM) -- Device #0: Intel(R) FPGA Emulation Device Platform #1: Intel(R) OpenCL -- Device #0: Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz Platform #2: Intel(R) OpenCL Graphics -- Device #0: Intel(R) Graphics [0x56a1]

harimkang commented 2 months ago

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

ip2016 commented 2 months ago

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

sovrasov commented 2 months ago

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

You're right it's ARC-specific. otx[xpu] installs a patched torch + IPEX, which messes up output types sometimes. Currently, workaround is to conduct export in a cpu or cuda environment (i.e. use upstream torch).

ip2016 commented 2 months ago

@sovrasov Who is it appropriate to assign this to? (ARC GPU issue)

I'm not sure that this is ARC GPU specific issue. I'm observing the same error with CPU training/validation/export.

You're right it's ARC-specific. otx[xpu] installs a patched torch + IPEX, which messes up output types sometimes. Currently, workaround is to conduct export in a cpu or cuda environment (i.e. use upstream torch).

Thanks. I'll try it out.

ip2016 commented 2 months ago

Update: I have different error trying to train on CPU with otx[base] package:

RuntimeError: "nms_kernel" not implemented for 'BFloat16'

sovrasov commented 2 months ago

Update: I have different error trying to train on CPU with otx[base] package:

RuntimeError: "nms_kernel" not implemented for 'BFloat16'

Training with upstream torch is not required: the checkpoint trained on ARC with IPEX should work in upstream torch as well