Closed denise-k closed 5 months ago
CC: @tqchen @MasterJH5574
I believe ultimately the goal is to allow users specify the names they prefer just like gcc:
gcc main.c -o my_name
MLC LLM compilation should allow such a thing as well:
python3 -m mlc_llm.compile ... -o xxx-cuda.so
When we start to do sharding beyond 1 it might indeed makes sense to bring sharding number as part of the name by default
The new compilation pipeline requires users to specify the name of TVM compiled .so binary explicitly:
python -m mlc_chat.cli.compile --config llama2_70b --quantization q4f16_1 -o /tmp/1.so
cc @junrushao
🚀 Feature
The latest version of TVM enables different multi-GPU sharding configurations to reuse the same param weight shards. So, if you want to compile a model via MLC that has options for 1-GPU shard, 2-GPU shards, and 4-GPU shards (for example), you could reuse the same
params*.bin
and switch the compiled.so
binary.Currently, the naming convention for the
.so
binary is simply the model name. I am proposing to update the naming convention to specifynum_shards
within the name of the compiled binary, and keep them within one model directory, as shown below:Current naming convention:
Proposed new naming convention:
Motivation
This cleans up the file structure of the compiled model + params and allows for better reuse.