triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.82k stars 1.42k forks source link

Building from source fails with tensorrt_llm backend #7382

Open arya-samsung opened 4 weeks ago

arya-samsung commented 4 weeks ago

Description While building from source, the build fails when tensorrt_llm backend is chosen.

Triton Information What version of Triton are you using? r24.04

Are you using the Triton container or did you build it yourself? Building from source

To Reproduce Steps to reproduce the behavior. checkout r24.04 branch of server run: ./build.py -v --backend=python --enable-logging --endpoint=http --enable-tracing --enable-stats --enable-gpu --backend=tensorrtllm

this gives the error CMake Error at tensorrt_llm/CMakeLists.txt:107 (message): The batch manager library is truncated or incomplete. This is usually caused by using Git LFS (Large File Storage) incorrectly. Please try running command git lfs install && git lfs pull.

so we tried adding: self.cmd(f"cd {subdir} && git submodule init && git submodule update --merge && git lfs install && git lfs pull && cd ..", check_exitcode=True,)

after the git clone step here: https://github.com/triton-inference-server/server/blob/bf430f8589c82c57cc28e64be456c63a65ce7664/build.py#L325

but this did not help

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). NA

Expected behavior The built should have completed successfully, with no errors, and the docker image should have been ready

Additional Details: Build was attempted using the steps given here: https://github.com/triton-inference-server/tensorrtllm_backend/tree/main#option-1-build-via-the-buildpy-script-in-server-repo

But this failed with the following error:

cp: cannot stat '/tmp/tritonbuild/tensorrtllm/build/triton_tensorrtllm_worker': No such file or directory error: build failed

SeibertronSS commented 4 weeks ago

这大概是因为你的Batch Manager的静态文件不完整导致的

arya-samsung commented 3 weeks ago

这大概是因为你的Batch Manager的静态文件不完整导致的

https://github.com/NVIDIA/TensorRT-LLM/tree/main/cpp/tensorrt_llm/batch_manager/aarch64-linux-gnu - this right?

thanks for the lead, will check on this :)

arya-samsung commented 2 weeks ago

After fixing the batch manager files issue, got this error:

Installed /tmp/tritonbuild/tensorrtllm/tensorrt_llm/3rdparty/cutlass/python Processing dependencies for cutlass-library==3.4.1 Finished processing dependencies for cutlass-library==3.4.1 -- MANUALLY APPENDING FLAG TO COMPILE FOR SM_90a. -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- Operating System: ubuntu, 22.04 -- Performing Test HAS_FLTO -- Performing Test HAS_FLTO - Success -- Found pybind11: /usr/local/lib/python3.10/dist-packages/pybind11/include (found version "2.13.1") CMake Error at tensorrt_llm/plugins/CMakeLists.txt:108 (set_target_properties): set_target_properties called with incorrect number of arguments.

-- Found Python: /usr/bin/python3.10 (found version "3.10.12") found components: Interpreter -- ========================= Importing and creating target nvonnxparser ========================== -- Looking for library nvonnxparser -- Library that was found /usr/lib/x86_64-linux-gnu/libnvonnxparser.so -- ========================================================================================== -- Configuring incomplete, errors occurred! Traceback (most recent call last): File "/tmp/tritonbuild/tensorrtllm/build/../tensorrt_llm/scripts/build_wheel.py", line 332, in main(**vars(args)) File "/tmp/tritonbuild/tensorrtllm/build/../tensorrt_llm/scripts/build_wheel.py", line 162, in main build_run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'cmake -DCMAKE_BUILD_TYPE="Release" -DBUILD_PYT="ON" -DBUILD_PYBIND="ON" -DNVTX_DISABLE="ON" -DTRT_LIB_DIR=/usr/local/tensorrt/targets/x86_64-linux-gnu/lib -DTRT_INCLUDE_DIR=/usr/local/tensorrt/include -S "/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp"' returned non-zero exit status 1. error: build failed

letmerecall commented 1 week ago

Were you able to figure it out @arya-samsung? Facing a similar issue with branch r24.06.

arya-samsung commented 1 week ago

nope :( still facing it, will update here if a solution's been found.. do let me know too in case you find a workaround/solution..

EmileDqy commented 6 days ago

Hi,

I encountered the same issue when I followed the Option 1 of Build the Docker Container.

I did manage to fix the issue:

1) Open ./build.py 2) Modify this line by replacing triton_tensorrtllm_worker with trtllmExecutorWorker

Previous version:

    cmake_script.cp(
        os.path.join(tensorrtllm_be_dir, "build", "triton_tensorrtllm_worker"),
        cmake_destination_dir,
    )

Fixed version:

    cmake_script.cp(
        os.path.join(tensorrtllm_be_dir, "build", "trtllmExecutorWorker"),
        cmake_destination_dir,
    )

3) Run the script and it should work

Credits to #7194 for the fix. I don't know why this commit is not in the r24.05 (and r24.04, it seems) branch, as the last commit on that branch dates back to May 29th, whereas this pull request was merged on May 8th.

Cheers

arya-samsung commented 10 hours ago

After making change from #7194 and trying to build again, got the following error:

/usr/bin/ld: libtriton_tensorrtllm_common.so: undefined reference to tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::__cxx11::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::__cxx11::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::__cxx11::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool)' /usr/bin/ld: libtriton_tensorrtllm_common.so: undefined reference totensorrt_llm::batch_manager::NamedTensor::NamedTensor(nvinfer1::DataType, std::vector<long, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, void const*)' collect2: error: ld returned 1 exit status gmake[2]: Leaving directory '/tmp/tritonbuild/tensorrtllm/build' gmake[2]: [CMakeFiles/triton-tensorrt-llm-worker.dir/build.make:109: triton_tensorrtllm_worker] Error 1 gmake[1]: [CMakeFiles/Makefile2:335: CMakeFiles/triton-tensorrt-llm-worker.dir/all] Error 2