triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Docker build of Triton Server r24.07 on Ubuntu 22.04/Arm fails #7513

Open goetzrieger opened 1 month ago

goetzrieger commented 1 month ago

Description I'm trying to build a custom CPU-only Triton server for Edge usage to limit image size

Errors out with:

cd /tmp/tritonbuild/tritonserver/build/_deps/repo-core-build/triton-core/test && /usr/bin/cmake -E cmake_link_script CMakeFiles/repo_agent_test.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG CMakeFiles/repo_agent_test.dir/repo_agent_test.cc.o CMakeFiles/repo_agent_test.dir/__/repo_agent.cc.o CMakeFiles/repo_agent_test.dir/__/status.cc.o CMakeFiles/repo_agent_test.dir/__/filesystem/api.cc.o CMakeFiles/repo_agent_test.dir/__/model_config_utils.cc.o "../_deps/repo-common-build/protobuf/CMakeFiles/proto-library.dir/grpc_service.pb.cc.o" "../_deps/repo-common-build/protobuf/CMakeFiles/proto-library.dir/health.pb.cc.o" "../_deps/repo-common-build/protobuf/CMakeFiles/proto-library.dir/model_config.pb.cc.o" -o repo_agent_test  ../_deps/repo-common-build/src/libtritoncommonerror.a ../_deps/repo-common-build/libtritoncommonmodelconfig.a ../_deps/repo-common-build/src/libtritoncommonlogging.a ../lib/libgtest.a ../lib/libgtest_main.a /tmp/tritonbuild/tritonserver/build/third-party/protobuf/lib/libprotobuf.a /tmp/tritonbuild/tritonserver/build/third-party/protobuf/lib/libprotobuf.a ../lib/libgtest.a 
gmake[5]: Leaving directory '/tmp/tritonbuild/tritonserver/build/_deps/repo-core-build/triton-core'
[ 51%] Built target repo_agent_test
gmake[4]: Leaving directory '/tmp/tritonbuild/tritonserver/build/_deps/repo-core-build/triton-core'
gmake[3]: *** [Makefile:136: all] Error 2
gmake[3]: Leaving directory '/tmp/tritonbuild/tritonserver/build/_deps/repo-core-build/triton-core'
gmake[2]: *** [_deps/repo-core-build/CMakeFiles/triton-core.dir/build.make:86: _deps/repo-core-build/triton-core/src/triton-core-stamp/triton-core-build] Error 2
gmake[2]: Leaving directory '/tmp/tritonbuild/tritonserver/build'
gmake[1]: *** [CMakeFiles/Makefile2:726: _deps/repo-core-build/CMakeFiles/triton-core.dir/all] Error 2
gmake[1]: Leaving directory '/tmp/tritonbuild/tritonserver/build'
gmake: *** [Makefile:136: all] Error 2

error: build failed

I'm by no means a developer, anything obviously wrong I'm doing?

Thanks!

nv-kmcgill53 commented 1 month ago

Hi @goetzrieger, since the build is multi-threaded, the actual error may be higher in the output usually surrounded by ***. Not sure if you can either search for those instances or search for Error:. If it's too high up in the terminal output you can try piping the build to a file so that you can grep through it as well. Something along the lines of:

# '> out.txt' pipes stdout to out.txt
# '2>&1' redirects stderr to stdout where stdout is writing to out.txt
$ ./build.py --backend onnxruntime -v > out.txt 2>&1
goetzrieger commented 1 month ago

Thanks Kyle! I did that and had a better look at the output... but TBH this is not my area of expertise so I can't really find anything even if there are some more Errors.

Don't know if this makes sense but I'm sharing my output here, would be great if you could have a quick look.

As said we plan on using Triton on some kind of edge device, actually a Raspi 4 powered robot for event workshops. And the full image is just too big with 11GB or so.

out.txt

nv-kmcgill53 commented 1 month ago

Looking through your logs, it appears there is an issue converting a uint64_t to size_t.

/tmp/tritonbuild/tritonserver/build/_deps/repo-core-src/src/cache_entry.cc:177:34: error: cannot convert 'uint64_t*' {aka 'long long unsigned int*'} to 'size_t*' {aka 'unsigned int*'}
  177 |         output, base + position, &packed_output_byte_size));
      |                                  ^~~~~~~~~~~~~~~~~~~~~~~~
      |                                  |
      |                                  uint64_t* {aka long long unsigned int*}

(Sorry for adding the control chars. I don't really want to spend time scrubbing the logs.) This looks to be because your system defines size_t as an unsigned integer while we have explicitly defined packed_output_byte_size to uint64_t. I know this is not the answer you want to hear, but this is a "system dependent" error.

@rmccorm4 I think this stems from the fact that we are not consistent with our use of size types in our code base. Probably some tech debt that needs to be addressed. Do you have any further suggestions?

goetzrieger commented 1 month ago

I'm quite happy you are looking into this in the first place.

What I don't really get is, that the full 11GB Triton Arm image I can grab from the Nvidia registry works fine on the Raspi. And if I understand correctly the whole build happens in a container with all deps anyway. So is this error due to trying to build on Raspi?

Should I try to build on some other Arm system like an AWS instance?

nv-kmcgill53 commented 1 month ago

I was curious about what is going on with the compiler and I think there is something funny going on with the way size_t is getting sized (no pun intended). I couldn't reproduce the error my minimal godbolt example unfortunately although you can observe that y has a width of 4 bytes and x has a width of 8 bytes. I guess this also shows off the non-deterministic width of the standard types. Perhaps you can play around with my example and get something that reproduces? I'm not sure which compiler to use so I used gcc 13 and gcc 11. I had MSVC just for fun to compare :)

goetzrieger commented 1 month ago

I'm afraid this is way past my skills with cpp etc... ;)

I'm wondering how and where the Nvidia Arm Docker image has been build that works on Raspi?