janhq / cortex.cpp

Run and customize Local LLMs.
https://cortex.so
Apache License 2.0
1.92k stars 107 forks source link

Discussion: Cortex.cpp Engine Dependencies Architecture (e.g. CUDA Toolkit) #1046

Closed namchuai closed 2 weeks ago

namchuai commented 3 weeks ago

Motivation

  1. Do we package the cuda toolkit to the engine? Yes? Then will have to do the same for llamacpp, tensorrt-llm and onnx? No? Will download separatedly

  2. Folder structures (e.g if user have llamacpp, tensorrt at the same time)?

Resources Llamacpp release Currently we are downloading toolkit dependency via https://catalog.jan.ai/dist/cuda-dependencies/<version>/<platform>/cuda.tar.gz

cc @vansangpfiev @nguyenhoangthuan99 @dan-homebrew

Update sub-tasks:

vansangpfiev commented 3 weeks ago

On my perspective, we should download CUDA toolkit separately. We support multiple engines: cortex.llamacpp and cortex.tensorrt-llm, both need CUDA toolkit to run. CUDA is backward compatible so we only need the latest CUDA toolkit version that supported by nvidia-driver version. For example:

Edit: I just checked the cuda matrix compatibility and it is incorrect that CUDA is always backward compatible image

Related ticket: https://github.com/janhq/cortex/issues/1047

Edit 2: The above image is forward compatibility between cuda and nvidia-version image

From CUDA 11 onwards, applications compiled with a CUDA Toolkit release from within a CUDA major release family can run

So yes, CUDA is backward compatible within a CUDA major release reference: https://docs.nvidia.com/deploy/cuda-compatibility/#minor-version-compatibility

nguyenhoangthuan99 commented 3 weeks ago

I also think we need to Download CUDA toolkit separately, both tensorrt llm and llamacpp require cuda, plus: inside tensorrt llm package (~ 1GB) doesn't include cuda toolkit lib (cublas, cuparse, ... which is very heavy ~400Mb), if we decided to pack everything in 1 package for both tensorrt-llm and llamacpp, the size will increase. Screenshot from 2024-08-29 10-46-02

namchuai commented 3 weeks ago

I'm referring this table to check for the compatibility between driver and toolkit https://docs.nvidia.com/deeplearning/cudnn/latest/reference/support-matrix.html#gpu-cuda-toolkit-and-cuda-driver-requirements

Screenshot 2024-08-29 at 11 27 22
dan-homebrew commented 3 weeks ago

Can I verify my understanding of the issue:

Decision For Nvidia GPU users, the different engines have CUDA dependencies that are large 200-400mb downloads.

  1. Per-engine CUDA dependencies (i.e. install separately)
  2. Download 1 CUDA Toolkit for all engines

My initial thoughts

This will be disk-space inefficient. However, the alternative seems to be dependency hell, which I think is even worse.

Folder Structure

/cortex
    /engines
        /llama.cpp-extension
            /deps                               # CUDA dlls
        /tensorrt-llm-extension
            /deps                               # CUDA dlls

That said, am open to all ideas, especially @vansangpfiev's

vansangpfiev commented 3 weeks ago

If disk-space inefficient is acceptable, I think we can go with option 1. Please note that we will have some blockers for this option:

namchuai commented 3 weeks ago

Thanks @vansangpfiev and @dan-homebrew

I'm confirming that we agree with: Question 1: Packaging CUDA toolkit dependencies into corresponding engine. Caveats:

Question 2: Storing CUDA dependencies under corresponding engines.

/cortex
    /engines
        /cortex.llamacpp
            /deps                               # CUDA dlls
        /cortex.tensorrt-llm
            /deps                               # CUDA dlls

Caveats:

Additional thought @vansangpfiev , I think when we change the CI for engine, could we associated a file which contains the versions of the engine and info of its dependencies. This will help for engine list command in the future. wdyt? cc @nguyenhoangthuan99

0xSage commented 2 weeks ago
  1. What if llamacpp vs tensorrtllm dependencies start to conflict?

  2. Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.

  3. How will we do maintenance and updates? i.e.

    • cortex update requires dependency bumps
    • cortex update doesn't require dependency bump (easier)
  4. Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

/.cortex
    /deps
        /cuda
            cuda-11.5 or whatever versioning
    /engines
        /cortex.llamacpp
            /bin
        /cortex.tensorrt-llm
            /bin
  1. Are there dependency mgmt tools we can use to manage this better?
namchuai commented 2 weeks ago

@0xSage , here's my thought. Please correct me if I'm wrong @nguyenhoangthuan99 @vansangpfiev

  1. What if llamacpp vs tensorrtllm dependencies start to conflict?
    • Yeah, that's why we separating dependencies for cortex.llamacpp and cortex.tensorrt-llm
      /engines
      /cortex.llamacpp
          /deps
      /cortex.tensorrt-llm
          /deps
  2. Do we care about engine portability. And does doing a dynamic library search path on windows affect portability.
    • Hmm, I might not really get what do you meant by portability. Regarding the dynamic library search path, it's because, in windows, the program will search for the DLLs at the current path (IIRC, same path as the executable). But we are about to put the dependencies under cortex.llamacpp/deps and cortex.tenssorrt-llm/deps, we need to tell the OS where to look for the DLLs.
  3. How will we do maintenance and updates? I think this is good question. We haven't decided on this yet. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99 @hiento09
  4. Separating CUDA dependencies? I think this is good idea. But separates DLLs might cost us some efforts to handle it properly? I'm not sure. WDYT? @vansangpfiev @dan-homebrew @nguyenhoangthuan99
  5. Are there dependency mgmt tools we can use to manage this better? I think no. Currently, I think we only have cortexcpp and the default behavior of install is replace.
vansangpfiev commented 2 weeks ago

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed. For 4, I think it is easier for us to locate all cuda dependencies in the same folder as engine because we don't need to check which cuda version is using for which engine version

dan-homebrew commented 2 weeks ago

@vansangpfiev @namchuai @0xSage Quick responses:

Per-Engine Dependencies

Is this a dumb idea: store CUDA dependencies in a central location, such as a separate deps directory at the project root, and then use symbolic links or environment variables to point to the engine-specific dependencies.

I also agree with @vansangpfiev: let's co-locate all CUDA dependencies with the engine folder.

Simple > Complex, especially since model files are >4gb.

Updating Engines

For 3, I think we can do the maintenance and updates by versioning: generate a file (for example version.txt) for each release, which has metadata for engine version and cuda version. We will update cuda dependencies if needed.

I also think we need to think through the CLI and API commands:

cortex engines update tensorrt-llm
PUT <API URL>? 

Naming

I wonder whether it is better for us to have clearer naming for Cortex engines:

This articulates the concept of Cortex engines more clearly. Hopefully, with a clear API, the community can also step in to help build backends.

We would need to reason through cortex.python separately.