mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
18.77k stars 1.53k forks source link

[Question] how to serve 72B Qwen1.5 into 4x3090 gpu? #1999

Closed leiwen83 closed 1 month ago

leiwen83 commented 6 months ago

It seems to me that for now mlc is trying to loading all weight into one gpu card?

After convert_weight/gen_config/compile, it report error when ready to serve:

AssertionError: Cannot estimate KV cache capacity. The model weight size 40666677248.0 may be larger than GPU memory size 25447170048

If try set MLC_GPU_SIZE_BYTES=103079215104, which is memory sum number for 4gpu card. it would report error when loading weight:

  what():  [15:42:40] /data/tmp/test/llm/mlc-llm/3rdparty/tvm/src/runtime/cuda/cuda_device_api.cc:138: InternalError: Check failed: (e == cudaSuccess || e == cudaErrorCudartUnloading) is false: CUDA: out of memory
leiwen83 commented 6 months ago

Try qwen2 multi-gpu support patch https://github.com/mlc-ai/mlc-llm/pull/1985 with latest code: https://github.com/mlc-ai/mlc-llm/commit/ae97b8d3763cd9ef9179140027d206622d185d21

But got below error when compile model.

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module>
    main()
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/__main__.py", line 24, in main
    cli.main(sys.argv[2:])
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/cli/compile.py", line 128, in main
    compile(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/interface/compile.py", line 232, in compile
    _compile(args, model_config)
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/interface/compile.py", line 178, in _compile
    args.build_func(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/support/auto_target.py", line 242, in build
    relax.build(
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/relax/vm_build.py", line 341, in build
    return _vmlink(
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/relax/vm_build.py", line 247, in _vmlink
    lib = tvm.build(
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
    rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
    raise_last_ffi_error()
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm.error.InternalError: Traceback (most recent call last):
  11: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}>(tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
  10: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
  9: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&)
  8: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential)
  7: tvm::transform::Pass::operator()(tvm::IRModule) const
  6: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  5: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  3: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
  2: _ZN3tvm7runtime13PackedFuncObj
  1: tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::tir::transform::MakePackedAPI()::{lambda(tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::tir::transform::MakePackedAPI()::{lambda(tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
  0: tvm::tir::MakePackedAPI(tvm::tir::PrimFunc)
  File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/src/tir/transforms/make_packed_api.cc", line 374
leiwen83 commented 6 months ago

After revert to a0484bd53854a508283be47d62b704b2c737259d, with https://github.com/mlc-ai/mlc-llm/pull/1985 still get cuda oom for 72B int4 with 4gpus

leiwen83 commented 6 months ago

@tlopex any idea?

MasterJH5574 commented 6 months ago

@leiwen83 Are you running without quantization? 72B model doesn't seem to fit four 3090 with each having 24GB vRAM.

leiwen83 commented 6 months ago

@leiwen83 Are you running without quantization? 72B model doesn't seem to fit four 3090 with each having 24GB vRAM.

I am running with quantization from HF model: Here is the commands I use for the model generation

python3 -m mlc_llm convert_weight --quantization q4f16_1 /Qwen1.5-72B/ --output Qwen1.5-72B-Chat-GPTQ-Int4_MLC
python3 -m mlc_llm gen_config --conv-template chatml  --tensor-parallel-shards 4  --quantization q4f16_1 //Qwen1.5-72B/ --output Qwen1.5-72B-Chat-GPTQ-Int4_MLC/
python3 -m mlc_llm compile Qwen1.5-72B-Chat-GPTQ-Int4_MLC/ -o Qwen1.5-72B-Chat-q4f16_1-MLC.so
leiwen83 commented 6 months ago

I find there is setting named "MLC_INTERNAL_PRESHARD_NUM". Do I need to set this when do the convert model and before serving?

MasterJH5574 commented 6 months ago

No, MLC_INTERNAL_PRESHARD_NUM is not related.

I just saw that Qwen TP support was just enabled yesterday in https://github.com/mlc-ai/mlc-llm/pull/1985. Could you check out the latest main branch and try gen_config and compile again? convert_weight is not need for the second time.

leiwen83 commented 6 months ago

Still get error...

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module>
    args: argparse.Namespace = parse_args_and_initialize()
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize
    engine = async_engine.AsyncThreadedEngine(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length
    assert max_total_sequence_length > 0, (
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
tlopex commented 6 months ago

@leiwen83 Sorry for late reply. Maybe you can try again with the latest main branch since there is a update today.

leiwen83 commented 6 months ago

After pull latest commit, it would report error when converting the model.

# python3 -m mlc_llm convert_weight --quantization q4f16_1 Qwen1.5-72B-Chat --output Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:03] INFO auto_config.py:115: Found model configuration: Qwen1.5-72B-Chat/config.json
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:0
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:1
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:2
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:3
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:4
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:5
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:6
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:7
[2024-03-28 14:47:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-28 14:47:13] INFO auto_device.py:85: Not found device: metal:0
[2024-03-28 14:47:14] INFO auto_device.py:85: Not found device: vulkan:0
[2024-03-28 14:47:15] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-28 14:47:15] INFO auto_device.py:33: Using device: cuda:0
[2024-03-28 14:47:15] INFO auto_weight.py:70: Finding weights in: Qwen1.5-72B-Chat
[2024-03-28 14:47:15] INFO auto_weight.py:136: Not found Huggingface PyTorch
[2024-03-28 14:47:15] INFO auto_weight.py:143: Found source weight format: huggingface-safetensor. Source configuration: Qwen1.5-72B-Chat/model.safetensors.index.json
[2024-03-28 14:47:15] INFO auto_weight.py:106: Using source weight configuration: Qwen1.5-72B-Chat/model.safetensors.index.json. Use `--source` to override.
[2024-03-28 14:47:15] INFO auto_weight.py:110: Using source weight format: huggingface-safetensor. Use `--source-format` to override.
[2024-03-28 14:47:15] INFO auto_config.py:153: Found model type: qwen2. Use `--model-type` to override.
Weight conversion with arguments:
  --config          Qwen1.5-72B-Chat/config.json
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      qwen2
  --device          cuda:0
  --source          Qwen1.5-72B-Chat/model.safetensors.index.json
  --source-format   huggingface-safetensor
  --output          Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:15] INFO qwen2_model.py:48: context_window_size not found in config.json. Falling back to max_position_embeddings (32768)
[2024-03-28 14:47:15] INFO qwen2_model.py:65: prefill_chunk_size defaults to context_window_size (32768)
Start storing to cache Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:35] INFO huggingface_loader.py:184: Loading HF parameters from: Qwen1.5-72B-Chat/model-00038-of-00038.safetensors
[2024-03-28 14:47:40] INFO group_quantization.py:234: Compiling quantize function for key: ((152064, 8192), float16, cuda, axis=1, output_transpose=False)
[2024-03-28 14:47:41] INFO huggingface_loader.py:166: [Quantized] Parameter: "lm_head.q_weight", shape: (152064, 1024), dtype: uint32
  0%|                                                                                                                                                                                      | 0/563 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module>
    main()
  File "/data/mlc-llm/python/mlc_llm/__main__.py", line 28, in main
    cli.main(sys.argv[2:])
  File "/data/mlc-llm/python/mlc_llm/cli/convert_weight.py", line 87, in main
    convert_weight(
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 182, in convert_weight
    _convert_args(args)
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 146, in _convert_args
    tvmjs.dump_ndarray_cache(
  File "/data/mlc-llm/3rdparty/tvm/python/tvm/contrib/tvmjs.py", line 210, in dump_ndarray_cache
    for k, origin_v in param_generator:
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 130, in _param_generator
    for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs):
  File "/data/mlc-llm/python/mlc_llm/loader/huggingface_loader.py", line 122, in load
    if name in preshard_funcs:
TypeError: argument of type 'NoneType' is not iterable

If use previous converted but do genconfig/compile again, still get similar error as:

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module>
    args: argparse.Namespace = parse_args_and_initialize()
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize
    engine = async_engine.AsyncThreadedEngine(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length
    assert max_total_sequence_length > 0, (
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
NSTiwari commented 6 months ago

@leiwen83: I solved this error by adding the following line of code above line 115 in huggingface_loader.py: preshard_funcs = {}

After pull latest commit, it would report error when converting the model.

# python3 -m mlc_llm convert_weight --quantization q4f16_1 Qwen1.5-72B-Chat --output Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:03] INFO auto_config.py:115: Found model configuration: Qwen1.5-72B-Chat/config.json
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:0
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:1
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:2
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:3
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:4
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:5
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:6
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:7
[2024-03-28 14:47:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-28 14:47:13] INFO auto_device.py:85: Not found device: metal:0
[2024-03-28 14:47:14] INFO auto_device.py:85: Not found device: vulkan:0
[2024-03-28 14:47:15] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-28 14:47:15] INFO auto_device.py:33: Using device: cuda:0
[2024-03-28 14:47:15] INFO auto_weight.py:70: Finding weights in: Qwen1.5-72B-Chat
[2024-03-28 14:47:15] INFO auto_weight.py:136: Not found Huggingface PyTorch
[2024-03-28 14:47:15] INFO auto_weight.py:143: Found source weight format: huggingface-safetensor. Source configuration: Qwen1.5-72B-Chat/model.safetensors.index.json
[2024-03-28 14:47:15] INFO auto_weight.py:106: Using source weight configuration: Qwen1.5-72B-Chat/model.safetensors.index.json. Use `--source` to override.
[2024-03-28 14:47:15] INFO auto_weight.py:110: Using source weight format: huggingface-safetensor. Use `--source-format` to override.
[2024-03-28 14:47:15] INFO auto_config.py:153: Found model type: qwen2. Use `--model-type` to override.
Weight conversion with arguments:
  --config          Qwen1.5-72B-Chat/config.json
  --quantization    GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
  --model-type      qwen2
  --device          cuda:0
  --source          Qwen1.5-72B-Chat/model.safetensors.index.json
  --source-format   huggingface-safetensor
  --output          Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:15] INFO qwen2_model.py:48: context_window_size not found in config.json. Falling back to max_position_embeddings (32768)
[2024-03-28 14:47:15] INFO qwen2_model.py:65: prefill_chunk_size defaults to context_window_size (32768)
Start storing to cache Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:35] INFO huggingface_loader.py:184: Loading HF parameters from: Qwen1.5-72B-Chat/model-00038-of-00038.safetensors
[2024-03-28 14:47:40] INFO group_quantization.py:234: Compiling quantize function for key: ((152064, 8192), float16, cuda, axis=1, output_transpose=False)
[2024-03-28 14:47:41] INFO huggingface_loader.py:166: [Quantized] Parameter: "lm_head.q_weight", shape: (152064, 1024), dtype: uint32
  0%|                                                                                                                                                                                      | 0/563 [00:06<?, ?it/s]
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module>
    main()
  File "/data/mlc-llm/python/mlc_llm/__main__.py", line 28, in main
    cli.main(sys.argv[2:])
  File "/data/mlc-llm/python/mlc_llm/cli/convert_weight.py", line 87, in main
    convert_weight(
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 182, in convert_weight
    _convert_args(args)
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 146, in _convert_args
    tvmjs.dump_ndarray_cache(
  File "/data/mlc-llm/3rdparty/tvm/python/tvm/contrib/tvmjs.py", line 210, in dump_ndarray_cache
    for k, origin_v in param_generator:
  File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 130, in _param_generator
    for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs):
  File "/data/mlc-llm/python/mlc_llm/loader/huggingface_loader.py", line 122, in load
    if name in preshard_funcs:
TypeError: argument of type 'NoneType' is not iterable

If use previous converted but do genconfig/compile again, still get similar error as:

  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module>
    args: argparse.Namespace = parse_args_and_initialize()
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize
    engine = async_engine.AsyncThreadedEngine(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__
    kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
  File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length
    assert max_total_sequence_length > 0, (
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
leiwen83 commented 6 months ago

@leiwen83: I solved this error by adding the following line of code above line 115 in huggingface_loader.py: preshard_funcs = {}

Yep, with this, "argument of type 'NoneType' is not iterable" Error got fixed.

But below error still existed.

AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048

MasterJH5574 commented 4 months ago

Hi folks, sorry for the delayed response here. Last week we sent a patch that can fix this issue https://github.com/mlc-ai/mlc-llm/pull/2278. Note that we will need to rerun mlc_llm gen_config so that the config file is updated. Would appreciate if you can try again.

tqchen commented 1 month ago

This should be fixed