mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.57k stars 1.5k forks source link

[Question] Proper way to use multiple GPUs #2562

Closed 0xLienid closed 1 week ago

0xLienid commented 2 months ago

❓ General Questions

What is the proper way to actually utilize multiple GPUs? When I generate config, compile, and load the MLCEngine with multiple tensor shards it will still error out if the model size is larger than a single one of the GPU's memory. Also, if I check nvidia-smi it is only really utilizing one GPU.

e.g. this was run with 4 tensor shards

image

MasterJH5574 commented 2 months ago

Hi @0xLienid thanks for the question. There are a few ways to get things right.

  1. run mlc_llm gen_config with --tensor-parallel-shards 4 and run mlc_llm compile directly.
  2. run mlc_llm compile with --overrides "tensor_parallel_shards=4".

If you follow the two ways above, you don't need to specify tensor_parallel_shards when constructing MLCEngine.

It might be more helpful for us to triage the issue you encountered if you don't mind sharing the log printed when running mlc_llm compile your Python script.

0xLienid commented 2 months ago

will rerun for the logs in a few, but this is my config generation and compilation command calls that led to this GPU usage. for both, parallel_shards is set to 4. when the model is loaded it also says it's using the multi gpu loader.

from mlc_llm.interface.gen_config import gen_config

...

gen_config(
        config=config,
        model=model,
        quantization=quantization_obj,
        conv_template="LM",
        context_window_size=None,
        sliding_window_size=None,
        prefill_chunk_size=None,
        attention_sink_size=None,
        tensor_parallel_shards=parallel_shards,
        max_batch_size=1,
        output=quantization_dir
    )
from mlc_llm.interface.compile import compile as compile_mlc

...

    compile_mlc(
        config=config_file_compile,
        quantization=quantization_obj,
        model_type=model,
        target=target,
        opt=OptimizationFlags.from_str("O2"),
        build_func=build_func,
        system_lib_prefix="auto",
        output=SAVE_DIR / model_name / quantization / "compilation.so",
        overrides=ModelConfigOverride(
            context_window_size=None,
            sliding_window_size=None,
            prefill_chunk_size=None,
            attention_sink_size=None,
            max_batch_size=1,
            tensor_parallel_shards=parallel_shards
        ),
        debug_dump=None
    )
MasterJH5574 commented 2 months ago

Just wanna share some more pointers that may be helpful: in the log of compile, the model metadata will be printed out:

...
[2024-06-10 11:27:27] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-06-10 11:27:33] INFO compile.py:151: Running optimizations using TVM Unity
[2024-06-10 11:27:33] INFO compile.py:171: Registering metadata: {'model_type': 'qwen2',
'quantization': 'q4f16_1', 'context_window_size': 32768, 'sliding_window_size': -1,
'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 8,     <<<<<<<
'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-06-10 11:27:36] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
...

The expectation is to see 4 here.

MasterJH5574 commented 2 months ago

Another thing is, if your local MLC is installed before Jun 7, then you may need to upgrade to the latest nightly, as we fixed some related logic in #2533.

0xLienid commented 2 months ago

I will double check, but this should have been built from source as of yesterday

MasterJH5574 commented 2 months ago

when the model is loaded it also says it's using the multi gpu loader.

If it says this and prints the following log

[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:0
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #1] Loading model to device: cuda:1
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #2] Loading model to device: cuda:2
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #3] Loading model to device: cuda:3

then there is nothing wrong with gen_config and compile. And we might want to check the model size instead. Some log when loading the model will be much appreciated.

MasterJH5574 commented 2 months ago

I will double check, but this should have been built from source as of yesterday

Got it, then it should be fine as #2533 is already included.

0xLienid commented 2 months ago

when the model is loaded it also says it's using the multi gpu loader.

If it says this and prints the following log

[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:0
[xx:xx:xx] /workspace/mlc-llm/cpp/loader/multi_gpu_loader.cc:140: [Worker #0] Loading model to device: cuda:1

yes. it says this. the model is ~70GB and I have 4 A100s. so it should fit comfortably when sharded across them. for now i've worked around this by increasing the GPU mem share so that it all fits within one, but obviously that's less than ideal.

ok. once my evals are done running i'll rerun the config and compile to get logs

0xLienid commented 2 months ago

@MasterJH5574 this is the log image

MasterJH5574 commented 2 months ago

Thanks for sharing! It looks pretty normal actually. How is the log when loading parameters? How much can the progress bar go?