Discussion: Cortex.cpp Engine Dependencies Architecture (e.g. CUDA Toolkit)

namchuai commented 3 weeks ago

Motivation

Do we package the cuda toolkit to the engine? Yes? Then will have to do the same for llamacpp, tensorrt-llm and onnx? No? Will download separatedly
Folder structures (e.g if user have llamacpp, tensorrt at the same time)?

Resources Llamacpp release Currently we are downloading toolkit dependency via https://catalog.jan.ai/dist/cuda-dependencies/<version>/<platform>/cuda.tar.gz

cc @vansangpfiev @nguyenhoangthuan99 @dan-homebrew

Update sub-tasks:

[ ] #1085
[ ] #1086

vansangpfiev commented 3 weeks ago

On my perspective, we should download CUDA toolkit separately. We support multiple engines: cortex.llamacpp and cortex.tensorrt-llm, both need CUDA toolkit to run. CUDA is backward compatible so we only need the latest CUDA toolkit version that supported by nvidia-driver version. For example:

Nvidia driver: "527.41" compatible with CUDA 12.4
cortex.llamacpp: depends on CUDA 12.2
cortex.tensorrt-llm: depends on CUDA 12.4 We only need to download CUDA 12.4 to support both engines .

Edit: I just checked the cuda matrix compatibility and it is incorrect that CUDA is always backward compatible

Edit 2: The above image is forward compatibility between cuda and nvidia-version

From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run

So yes, CUDA is backward compatible within a CUDA major release reference: https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility

nguyenhoangthuan99 commented 3 weeks ago

I also think we need to Download CUDA toolkit separately, both tensorrt llm and llamacpp require cuda, plus: inside tensorrt llm package (~ 1GB) doesn't include cuda toolkit lib (cublas, cuparse, ... which is very heavy ~400Mb), if we decided to pack everything in 1 package for both tensorrt-llm and llamacpp, the size will increase. Screenshot from 2024-08-29 10-46-02

namchuai commented 3 weeks ago

I'm referring this table to check for the compatibility between driver and toolkit https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#gpu-cuda-toolkit-and-cuda-driver-requirements

dan-homebrew commented 3 weeks ago

Can I verify my understanding of the issue:

Decision For Nvidia GPU users, the different engines have CUDA dependencies that are large 200-400mb downloads.

Per-engine CUDA dependencies (i.e. install separately)
Download 1 CUDA Toolkit for all engines

My initial thoughts

I think per-engine CUDA dependencies are more sustainable architecture
It is less efficient, but easier to manage long-term
i.e. llama.cpp packages its cudart files that have been verified
i.e. TensorRT-LLM packages its own CUDA dependencies (may change in the future)

This will be disk-space inefficient. However, the alternative seems to be dependency hell, which I think is even worse.

Folder Structure

My ideal outcome for Cortex is where each engine is its own submodule, and manages its own folder structure.
We have invested in making cortex.llama.cpp a separate module; it should ideally be independent and package with its dependencies

/cortex
    /engines
        /llama.cpp-extension
            /deps                               # CUDA dlls
        /tensorrt-llm-extension
            /deps                               # CUDA dlls

That said, am open to all ideas, especially @vansangpfiev's

vansangpfiev commented 3 weeks ago

If disk-space inefficient is acceptable, I think we can go with option 1. Please note that we will have some blockers for this option:

dynamic library search path: we will have two paths for llamacpp and tensorrt-llm, a potential issue can happen when we mix them together. For ubuntu and MacOS, I think we can solve that issue by compiling with rpath flag. For windows, I have created related issue
CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

namchuai commented 3 weeks ago

Thanks @vansangpfiev and @dan-homebrew

I'm confirming that we agree with: Question 1: Packaging CUDA toolkit dependencies into corresponding engine. Caveats:

CI changes for cortex.llamacpp and cortex.tensorrt-llm to pack CUDA dependencies

Question 2: Storing CUDA dependencies under corresponding engines.

/cortex
    /engines
        /cortex.llamacpp
            /deps                               # CUDA dlls
        /cortex.tensorrt-llm
            /deps                               # CUDA dlls

Caveats:

dynamic library search path on windows

Additional thought @vansangpfiev , I think when we change the CI for engine, could we associated a file which contains the versions of the engine and info of its dependencies. This will help for engine list command in the future. wdyt? cc @nguyenhoangthuan99

0xSage commented 2 weeks ago

What if llamacpp vs tensorrtllm dependencies start to conflict?
Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.
How will we do maintenance and updates? i.e.
- cortex update requires dependency bumps
- cortex update doesn't require dependency bump (easier)
Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

/.cortex
    /deps
        /cuda
            cuda-11.5 or whatever versioning
    /engines
        /cortex.llamacpp
            /bin
        /cortex.tensorrt-llm
            /bin

Are there dependency mgmt tools we can use to manage this better?

namchuai commented 2 weeks ago

@0xSage , here's my thought. Please correct me if I'm wrong @nguyenhoangthuan99 @vansangpfiev

What if llamacpp vs tensorrtllm dependencies start to conflict?
- Yeah, that's why we separating dependencies for cortex.llamacpp and cortex.tensorrt-llm
```
/engines
/cortex.llamacpp
    /deps
/cortex.tensorrt-llm
    /deps
```
Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.
- Hmm, I might not really get what do you meant by portability. Regarding the dynamic library search path, it's because, in windows, the program will search for the DLLs at the current path (IIRC, same path as the executable). But we are about to put the dependencies under cortex.llamacpp/deps and cortex.tenssorrt-llm/deps, we need to tell the OS where to look for the DLLs.
How will we do maintenance and updates? I think this is good question. We haven't decided on this yet. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99 @hiento09
Separating CUDA dependencies? I think this is good idea. But separates DLLs might cost us some efforts to handle it properly? I'm not sure. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99
Are there dependency mgmt tools we can use to manage this better? I think no. Currently, I think we only have cortexcpp and the default behavior of install is replace.

vansangpfiev commented 2 weeks ago

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed. For 4, I think it is easier for us to locate all cuda dependencies in the same folder as engine because we don't need to check which cuda version is using for which engine version

dan-homebrew commented 2 weeks ago

@vansangpfiev @namchuai @0xSage Quick responses:

Per-Engine Dependencies

Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

From an architecture perspective, I would like for us to approach this from "each engine manages its own dependencies"
I would like to optimize for architectural simplicity at this stage of our library
I would like to re-use llama.cpp's bundled CUDA dependencies
Shared dependencies can be a subsequent optimization, that we tackle down the road as llama.cpp and TensorRT-LLM stabilize. From my POV, they are still changing very rapidly

I also agree with @vansangpfiev: let's co-locate all CUDA dependencies with the engine folder.

Simple > Complex, especially since model files are >4gb.

Updating Engines

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.

I also think we need to think through the CLI and API commands:

cortex engines update tensorrt-llm
PUT <API URL>?

Naming

I wonder whether it is better for us to have clearer naming for Cortex engines:

llamacpp-engine
onnx-engine
tensorrt-llm-engine

This articulates the concept of Cortex engines more clearly. Hopefully, with a clear API, the community can also step in to help build backends.

We would need to reason through cortex.python separately.

I think engine extensions can be in C++ or in Python
cortex.python might be more of a "Python base template", where Engine Extension dev can define Python version to bundle

janhq / cortex.cpp