Running LLama2 RT LLM model fails

amir1m commented 9 months ago

Description While running Llalma2(fine tuned on custom data) as per the tutorial Popular_Models_Guide/Llama2/trtllm_guide.md, getting following error:

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm/ root@Amir-Dev-GPU:/opt/tritonserver# I0119 11:58:39.014107 135 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f8ea8000000' with size 268435456 I0119 11:58:39.024138 135 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0119 11:58:39.024150 135 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 W0119 11:58:39.146633 135 server.cc:251] failed to enable peer access for some device pairs [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147422 135 model_repository_manager.cc:1325] Poll failed for model directory 'ensemble': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/ensemble/config.pbtxt [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147513 135 model_repository_manager.cc:1325] Poll failed for model directory 'postprocessing': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/postprocessing/config.pbtxt [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147562 135 model_repository_manager.cc:1325] Poll failed for model directory 'preprocessing': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/preprocessing/config.pbtxt [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147626 135 model_repository_manager.cc:1325] Poll failed for model directory 'tensorrt_llm': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/tensorrt_llm/config.pbtxt [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147674 135 model_repository_manager.cc:1325] Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt I0119 11:58:39.147695 135 server.cc:606] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0119 11:58:39.147708 135 server.cc:633] +---------+------+--------+ | Backend | Path | Config | +---------+------+--------+ +---------+------+--------+

I0119 11:58:39.147716 135 server.cc:676] +-------+---------+--------+ | Model | Version | Status | +-------+---------+--------+ +-------+---------+--------+

I0119 11:58:39.240317 135 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A10 I0119 11:58:39.240343 135 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A10 I0119 11:58:39.249242 135 metrics.cc:710] Collecting CPU metrics I0119 11:58:39.249371 135 tritonserver.cc:2483] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.41.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy modelconfiguration system | | | shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /opt/tritonserver/inflight_batcher_llm/ | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------+

I0119 11:58:39.249380 135 server.cc:307] Waiting for in-flight requests to complete. I0119 11:58:39.249383 135 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences I0119 11:58:39.249386 135 server.cc:338] All models are stopped, unloading models I0119 11:58:39.249388 135 server.cc:345] Timeout 30: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

root@Amir-Dev-GPU:/opt/tritonserver# root@Amir-Dev-GPU:/opt/tritonserver# -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[8187,1],0] Exit code: 1

Triton Information What version of Triton are you using? 2.41.0

Are you using the Triton container or did you build it yourself? Container

To Reproduce

Convert fine tuned Llama2 to Tensor RT LLM format model using byild.py.
Run the trinton container sudo docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/tensorrtllm_backend:/tensorrtllm_backend -v /home/ubuntu/models:/engines nvcr.io/nvidia/tritonserver:23.12-py3
Follow the steps given in Popular_Models_Guide/Llama2/trtllm_guide.md to configure
Launch launch_triton_server.py where it throws the error: ... ... W0119 11:58:39.146633 135 server.cc:251] failed to enable peer access for some device pairs [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0119 11:58:39.147422 135 model_repository_manager.cc:1325] Poll failed for model directory 'ensemble': failed to read text proto from /opt/tritonserver/inflight_batcher_llm/ensemble/config.pbtxt ... ....

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior Model should be loaded and served without error.

nv-kmcgill53 commented 9 months ago

CC: @jbkyang-nvi

jbkyang-nvi commented 9 months ago

Hello, unfortunately the tutorial is out of date because of rapid development on the TensorRT-LLM backend side. https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md is the more correct one.

It looks like you are missing some configurations in your config.pbtxt. Can you check it for the parameters? If you can't figure out what is wrong, you can post your config.pbtxt here.

amir1m commented 9 months ago

Hi @jbkyang-nvi , Thanks for your reply! I have followed this steps again however am now getting CUDA OOM error. I am running on two A10 48 GB Ubuntu VM. the model is fine tunes Llama-2 7B. E0123 05:11:51.404662 1991 backend_model.cc:635] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] CUDA runtime error in ::cudaMallocAsync(ptr, n, mCudaStream->get()): out of memory (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmBuffers.h:112)

Attaching my tensorrt_llm/config.pbtxt config.pbtxt.txt

shivamehta commented 8 months ago

Hi I am getting same error, not sure how to solve it :

jbkyang-nvi commented 8 months ago

Hi I am getting same error, not sure how to solve it :

Please follow changes in https://github.com/triton-inference-server/tutorials/pull/81. I believe @oandreeva-nv might also have a similar fix

jbkyang-nvi commented 6 months ago

https://github.com/triton-inference-server/tutorials/pull/81 merged. This should resolve this issue. Please respond + open new ticket if the problem persists

oandreeva-nv commented 6 months ago

There's an additional PR : https://github.com/triton-inference-server/tutorials/pull/91 which uses latest 24.04 container and simplifies the build process

triton-inference-server / server

Running LLama2 RT LLM model fails #6813

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[8187,1],0] Exit code: 1