modal-labs / modal-examples

Examples of programs built using Modal
https://modal.com/docs
MIT License
729 stars 175 forks source link

Unable to run tensortLLVM code : File not found error '/root/model/model_output/rank0.engine' #747

Closed avs20 closed 5 months ago

avs20 commented 6 months ago

Hi I tried running the tensortLLVM code from the local machine. The changes I did was increase max token for input and output to 7500. and reading the prompts from a txt file.

I also commented the web api code as I was going for batch inference.

Here is the image id - Building image im-E7GCowSnqMBcKNtQ89JEoo

here is my entrypoint function

@app.local_entrypoint()
def main():
    # Example usage
    file_path = r'/Users/t0mkaka/Desktop/seekh/outfile copy.txt'  # Replace with your text file path
    prompts = extract_prompts(file_path)
    print(len(prompts))
    questions = prompts

    model = Model()
    model.generate.remote(questions)
    # if you're calling this service from another Python project,
    # use [`Model.lookup`](https://modal.com/docs/reference/modal.Cls#lookup)

Here is the traceback

/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/31/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[05/18/2024-18:57:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/31/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[05/18/2024-18:57:35] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[05/18/2024-18:57:35] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[05/18/2024-18:57:35] [TRT] [W] Unused Input: position_ids
[05/18/2024-18:57:35] [TRT] [W] Detected layernorm nodes in FP16.
[05/18/2024-18:57:35] [TRT] [W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[05/18/2024-18:57:35] [TRT] [W] [RemoveDeadLayers] Input Tensor position_ids is unused or used only at compile-time, but is not being removed.
[05/18/2024-18:57:35] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 2543, GPU 803 (MiB)
[05/18/2024-18:57:35] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 2544, GPU 813 (MiB)
[05/18/2024-18:57:35] [TRT] [W] TensorRT was linked against cuDNN 8.9.6 but loaded cuDNN 8.9.2
[05/18/2024-18:57:35] [TRT] [I] Global timing cache in use. Profiling results in this builder pass will be stored.
[05/18/2024-18:57:45] [TRT] [W] Tactic Device request: 60144MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:45] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 60144 detected for tactic 0x0000000000000000.
[05/18/2024-18:57:45] [TRT] [W] Tactic Device request: 60144MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:45] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 60144 detected for tactic 0x0000000000000000.
[05/18/2024-18:57:46] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:46] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000000.
[05/18/2024-18:57:46] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:46] [TRT] [W] UNSUPPORTED_STATESkipping tactic 1 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000001.
[05/18/2024-18:57:46] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:46] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000002.
[05/18/2024-18:57:46] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:46] [TRT] [W] UNSUPPORTED_STATESkipping tactic 3 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000003.
[05/18/2024-18:57:47] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:47] [TRT] [W] UNSUPPORTED_STATESkipping tactic 4 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000004.
[05/18/2024-18:57:47] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:47] [TRT] [W] UNSUPPORTED_STATESkipping tactic 5 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000005.
[05/18/2024-18:57:47] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:47] [TRT] [W] UNSUPPORTED_STATESkipping tactic 6 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000006.
[05/18/2024-18:57:47] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:47] [TRT] [W] UNSUPPORTED_STATESkipping tactic 7 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000007.
[05/18/2024-18:57:48] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:48] [TRT] [W] UNSUPPORTED_STATESkipping tactic 8 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000008.
[05/18/2024-18:57:48] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:48] [TRT] [W] UNSUPPORTED_STATESkipping tactic 9 due to insufficient memory on requested size of 157500 detected for tactic 0x0000000000000009.
[05/18/2024-18:57:48] [TRT] [W] Tactic Device request: 157500MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:48] [TRT] [W] UNSUPPORTED_STATESkipping tactic 10 due to insufficient memory on requested size of 157500 detected for tactic 0x000000000000001c.
[05/18/2024-18:57:48] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:48] [TRT] [W] UNSUPPORTED_STATESkipping tactic 0 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000000.
[05/18/2024-18:57:48] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:48] [TRT] [W] UNSUPPORTED_STATESkipping tactic 1 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000001.
[05/18/2024-18:57:49] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:49] [TRT] [W] UNSUPPORTED_STATESkipping tactic 2 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000002.
[05/18/2024-18:57:49] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:49] [TRT] [W] UNSUPPORTED_STATESkipping tactic 3 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000003.
[05/18/2024-18:57:49] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:49] [TRT] [W] UNSUPPORTED_STATESkipping tactic 4 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000004.
[05/18/2024-18:57:49] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:49] [TRT] [W] UNSUPPORTED_STATESkipping tactic 5 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000005.
[05/18/2024-18:57:50] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 6 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000006.
[05/18/2024-18:57:50] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 7 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000007.
[05/18/2024-18:57:50] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 8 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000008.
[05/18/2024-18:57:50] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:50] [TRT] [W] UNSUPPORTED_STATESkipping tactic 9 due to insufficient memory on requested size of 78750 detected for tactic 0x0000000000000009.
[05/18/2024-18:57:51] [TRT] [W] Tactic Device request: 78750MB Available: 40339MB. Device memory is insufficient to use tactic.
[05/18/2024-18:57:51] [TRT] [W] UNSUPPORTED_STATESkipping tactic 10 due to insufficient memory on requested size of 78750 detected for tactic 0x000000000000001c.
[05/18/2024-18:57:51] [TRT] [E] 10: Could not find any implementation for node PWN(PWN(PWN(LLaMAForCausalLM/transformer/layers/0/mlp/SIGMOID_0), PWN(LLaMAForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_0)), PWN(LLaMAForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_1)).
[05/18/2024-18:57:51] [TRT] [E] 10: [optimizer.cpp::computeCosts::4048] Error Code 10: Internal Error (Could not find any implementation for node PWN(PWN(PWN(LLaMAForCausalLM/transformer/layers/0/mlp/SIGMOID_0), PWN(LLaMAForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_0)), PWN(LLaMAForCausalLM/transformer/layers/0/mlp/ELEMENTWISE_PROD_1)).)
[05/18/2024-18:57:51] [TRT-LLM] [E] Engine building failed, please check the error log.
[05/18/2024-18:57:51] [TRT] [I] Serialized 59 bytes of code generator cache.
[05/18/2024-18:57:51] [TRT] [I] Serialized 144872 bytes of compilation cache.
[05/18/2024-18:57:51] [TRT] [I] Serialized 0 timing cache entries
[05/18/2024-18:57:51] [TRT-LLM] [I] Timing cache serialized to model.cache
[05/18/2024-18:57:51] [TRT-LLM] [I] Total time of building all engines: 00:00:57
Creating image snapshot...
Finished snapshot; took 7.64s

Built image im-E7GCowSnqMBcKNtQ89JEoo in 82.38s
Building image im-S5Pa5SxFXovfCDwRwAehtz

=> Step 0: FROM base

=> Step 1: ENV TLLM_LOG_LEVEL=INFO
Creating image snapshot...
Finished snapshot; took 5.10s

Built image im-S5Pa5SxFXovfCDwRwAehtz in 8.37s
✓ Created objects.
├── 🔨 Created mount /Users/t0mkaka/Desktop/llama-3/modal-examples/06_gpu_and_ml/llm-serving/trtllm_llama.py
├── 🔨 Created download_model.
└── 🔨 Created Model.generate.
50

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

🥶 Cold boot: spinning up TRT-LLM engine
[05/18/2024-18:58:28] PyTorch version 2.2.2 available.
[05/18/2024-18:58:29] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
[TensorRT-LLM][INFO] Set logger level by INFO
[05/18/2024-18:58:30] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024042300
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Runner failed with exception: FileNotFoundError(2, 'No such file or directory')
Traceback (most recent call last):
  File "/pkg/modal/_container_io_manager.py", line 457, in handle_user_exception
    yield
  File "/pkg/modal/_container_entrypoint.py", line 463, in call_lifecycle_functions
    res = func(
  File "/root/trtllm_llama.py", line 287, in load
    self.model = ModelRunner.from_dir(**runner_kwargs)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py", line 624, in from_dir
    engine = Engine.from_dir(engine_dir, rank)
  File "/usr/local/lib/python3.10/site-packages/tensorrt_llm/builder.py", line 591, in from_dir
    with open(os.path.join(engine_dir, f'rank{rank}.engine'), 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/root/model/model_output/rank0.engine'
Stopping app - uncaught exception raised locally: FileNotFoundError(2, 'No such file or directory').
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/t0mkaka/Desktop/llama-3/modal-examples/06_gpu_and_ml/llm-serving/trtllm_llama.py:423 in   │
│ main                                                                                             │
│                                                                                                  │
│   422 │   model = Model()                                                                        │
│ ❱ 423 │   model.generate.remote(questions)                                                       │
│   424 │   # if you're calling this service from another Python project,                          │
│                                                                                                  │
│               ...Remote call to Modal Function (ta-01HY6H39CJ71QT5AMST7DAPRRS)...                │
│                                                                                                  │
│ /root/trtllm_llama.py:287 in load                                                                │
│                                                                                                  │
│ ❱ 287 self.model = ModelRunner.from_dir(**runner_kwargs)                                         │
│                                                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner.py:624 in from_dir     │
│                                                                                                  │
│ ❱ 624 engine = Engine.from_dir(engine_dir, rank)                                                 │
│                                                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/site-packages/tensorrt_llm/builder.py:591 in from_dir                  │
│                                                                                                  │
│ ❱ 591 with open(os.path.join(engine_dir, f'rank{rank}.engine'), 'rb') as f:      

Am I doing anything wrong ?

charlesfrye commented 5 months ago

Hi @avs20!

Your engine build is failing because the context sizes are too large to fit on VRAM of the GPU configuration you are using. You should set the sizes of the input and output contexts very carefully to match your intended workload, because you pay a hefty penalty in VRAM (and hence cost of inference) to support such long contexts.

If you actually need 15k total input/output context, you're going to need to distribute across multiple GPUs -- my guess is that should fit in 8x80GB A100/H100 for an 8GB model with small enough batch sizes, but you'd have to do the math to be sure. Consider quantizing the weights and KV cache as well.

If you want to follow up further, hit me up in the Modal slack (https://modal.com/slack).