mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.98k stars 1.56k forks source link

[Feature Request] Naming convention for TVM compiled .so binaries for each model #1021

Closed denise-k closed 5 months ago

denise-k commented 1 year ago

cc @junrushao

🚀 Feature

The latest version of TVM enables different multi-GPU sharding configurations to reuse the same param weight shards. So, if you want to compile a model via MLC that has options for 1-GPU shard, 2-GPU shards, and 4-GPU shards (for example), you could reuse the same params*.bin and switch the compiled .so binary.

Currently, the naming convention for the .so binary is simply the model name. I am proposing to update the naming convention to specify num_shards within the name of the compiled binary, and keep them within one model directory, as shown below:

Current naming convention:

dist/CodeLlama-13b-Instruct-hf-q0f16 
├── CodeLlama-13b-Instruct-hf-q0f16-cuda.so
├── mod_cache_before_build.pkl
└── params
    ├── mlc-chat-config.json
    ├── ndarray-cache.json
    ├── params_shard_*.bin

Proposed new naming convention:

dist/CodeLlama-13b-Instruct-hf-q0f16 
├── CodeLlama-13b-Instruct-hf-q0f16-cuda-1-shards.so
├── CodeLlama-13b-Instruct-hf-q0f16-cuda-2-shards.so
├── CodeLlama-13b-Instruct-hf-q0f16-cuda-4-shards.so
├── mod_cache_before_build.pkl
└── params
    ├── mlc-chat-config.json
    ├── ndarray-cache.json
    ├── params_shard_*.bin

Motivation

This cleans up the file structure of the compiled model + params and allows for better reuse.

junrushao commented 1 year ago

CC: @tqchen @MasterJH5574

junrushao commented 1 year ago

I believe ultimately the goal is to allow users specify the names they prefer just like gcc:

gcc main.c -o my_name

MLC LLM compilation should allow such a thing as well:

python3 -m mlc_llm.compile ...  -o xxx-cuda.so
tqchen commented 1 year ago

When we start to do sharding beyond 1 it might indeed makes sense to bring sharding number as part of the name by default

junrushao commented 11 months ago

The new compilation pipeline requires users to specify the name of TVM compiled .so binary explicitly:

python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so