microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.59k stars 2.92k forks source link

[Build] build error with --enable_training #19063

Closed DefTruth closed 9 months ago

DefTruth commented 9 months ago

Describe the issue

/workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/proto/config.pb.cc:702:6: error: ‘::descriptor_table_tensorboard_2fcompat_2fproto_2frewriter_5fconfig_2eproto’ has not been declared 702 | &::descriptor_table_tensorboard_2fcompat_2fproto_2frewriter_5fconfig_2eproto, | ^~~~~~~~~~~~~~~~~ /workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/proto/config.pb.cc:703:6: error: ‘::descriptor_table_tensorboard_2fcompat_2fproto_2fstep_5fstats_2eproto’ has not been declared 703 | &::descriptor_table_tensorboard_2fcompat_2fproto_2fstep_5fstats_2eproto, | ^~~~~~~~~~~~~~~~ gmake[2]: [tensorboard/compat/proto/CMakeFiles/tensorboard.dir/build.make:526: tensorboard/compat/proto/CMakeFiles/tensorboard.dir/rewriter_config.pb.cc.o] Error 1 gmake[2]: [tensorboard/compat/proto/CMakeFiles/tensorboard.dir/build.make:540: tensorboard/compat/proto/CMakeFiles/tensorboard.dir/saved_object_graph.pb.cc.o] Error 1 /workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/

Urgency

No response

Target platform

linux

Build script

./build.sh --cudnn_home /usr/lib/x86_64-linux-gnu --cuda_home /usr/local/cuda --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /workspace/dev/openlibs/TensorRT --enable_training --build_wheel --enable_training_apis --allow_running_as_root --skip_tests --config Release --parallel --build_shared_lib

Error / output

/workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/proto/config.pb.cc:702:6: error: ‘::descriptor_table_tensorboard_2fcompat_2fproto_2frewriter_5fconfig_2eproto’ has not been declared 702 | &::descriptor_table_tensorboard_2fcompat_2fproto_2frewriter_5fconfig_2eproto, | ^~~~~~~~~~~~~~~~~ /workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/proto/config.pb.cc:703:6: error: ‘::descriptor_table_tensorboard_2fcompat_2fproto_2fstep_5fstats_2eproto’ has not been declared 703 | &::descriptor_table_tensorboard_2fcompat_2fproto_2fstep_5fstats_2eproto, | ^~~~~~~~~~~~~~~~ gmake[2]: [tensorboard/compat/proto/CMakeFiles/tensorboard.dir/build.make:526: tensorboard/compat/proto/CMakeFiles/tensorboard.dir/rewriter_config.pb.cc.o] Error 1 gmake[2]: [tensorboard/compat/proto/CMakeFiles/tensorboard.dir/build.make:540: tensorboard/compat/proto/CMakeFiles/tensorboard.dir/saved_object_graph.pb.cc.o] Error 1 /workspace/dev/openlibs/onnxruntime/build/Linux/Release/tensorboard/compat/

Visual Studio Version

No response

GCC / Compiler Version

11.4

jywu-msft commented 9 months ago

Hi, which commit are you building from? there is some known issue with -enable_training and --use_tensorrt_oss_parser requiring full_protobuf now https://github.com/microsoft/onnxruntime/issues/18040#issuecomment-1859059916

DefTruth commented 9 months ago

Hi, which commit are you building from? there is some known issue with -enable_training and --use_tensorrt_oss_parser requiring full_protobuf now #18040 (comment)

i use the latest main branch. so, it means that i should add --use_full_protobuf flag ? it seems that the error was cause by tensorboard proto, how can i fix this error?

@jywu-msft

DefTruth commented 9 months ago

i want a python wheel that support TensorRT EP and OrtValue.from_dlpack/to_dlpack in order to support convert OrtValue directly to/from torch Tensor in CUDA device and run model on TensorRT EP, but these funcs are only available with --enable_training flag now. by the way, i have try run session with run_with_io_binding, but the performance was very unstable, range from 415ms to 500ms in my case (got stable avg 415ms if i run session with run_with_ort_values api) . so i want to use run_with_ort_values api and need to pass CUDA torch Tensor to OrtValue as inputs directly without D2H+H2D data copy.

https://github.com/microsoft/onnxruntime/blob/b25980c011f658709404f1b850302f94fc9541f4/orttraining/orttraining/python/training/ortmodule/_utils.py#L50

or, how can i fix the 'unstable' when using run_with_io_binding ?

@jywu-msft

DefTruth commented 9 months ago

Hi, which commit are you building from? there is some known issue with -enable_training and --use_tensorrt_oss_parser requiring full_protobuf now #18040 (comment)

i have add --use_full_protobuf, but still get the same error, how can i exclude tensorboard ?

./build.sh  --cudnn_home /usr/lib/x86_64-linux-gnu --cuda_home /usr/local/cuda --use_tensorrt --use_tensorrt_oss_parser --tensorrt_home /workspace/dev/openlibs/TensorRT --enable_training --build_wheel --allow_running_as_root --skip_tests --config Release --parallel --use_full_protobuf

my sys protoc version is:

 /usr/bin/protoc --version
libprotoc 3.12.4

@jywu-msft

DefTruth commented 9 months ago

or, is there have ways to create OrtValue from CUDA device buffer use python api ? it seems i can only use ortvalue_from_numpy.

@jywu-msft

DefTruth commented 9 months ago

i want a python wheel that support TensorRT EP and OrtValue.from_dlpack/to_dlpack in order to support convert OrtValue directly to/from torch Tensor in CUDA device and run model on TensorRT EP, but these funcs are only available with --enable_training flag now. by the way, i have try run session with run_with_io_binding, but the performance was very unstable, range from 415ms to 500ms in my case (got stable avg 415ms if i run session with run_with_ort_values api) . so i want to use run_with_ort_values api and need to pass CUDA torch Tensor to OrtValue as inputs directly without D2H+H2D data copy.

https://github.com/microsoft/onnxruntime/blob/b25980c011f658709404f1b850302f94fc9541f4/orttraining/orttraining/python/training/ortmodule/_utils.py#L50

or, how can i fix the 'unstable' when using run_with_io_binding ?

@jywu-msft

i finally solve this problem by modified some sources codes of ORT and add dlpack to the build sources manually