Closed leiwen83 closed 1 month ago
Try qwen2 multi-gpu support patch https://github.com/mlc-ai/mlc-llm/pull/1985 with latest code: https://github.com/mlc-ai/mlc-llm/commit/ae97b8d3763cd9ef9179140027d206622d185d21
But got below error when compile model.
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module>
main()
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/__main__.py", line 24, in main
cli.main(sys.argv[2:])
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/cli/compile.py", line 128, in main
compile(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/interface/compile.py", line 232, in compile
_compile(args, model_config)
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/interface/compile.py", line 178, in _compile
args.build_func(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/support/auto_target.py", line 242, in build
relax.build(
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/relax/vm_build.py", line 341, in build
return _vmlink(
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/relax/vm_build.py", line 247, in _vmlink
lib = tvm.build(
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/driver/build_module.py", line 297, in build
rt_mod_host = _driver_ffi.tir_to_runtime(annotated_mods, target_host)
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 239, in __call__
raise_last_ffi_error()
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm.error.InternalError: Traceback (most recent call last):
11: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<tvm::runtime::Module (tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)>::AssignTypedLambda<tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}>(tvm::__mk_TVM23::{lambda(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, tvm::runtime::TVMRetValue)
10: tvm::TIRToRuntime(tvm::runtime::Map<tvm::Target, tvm::IRModule, void, void> const&, tvm::Target const&)
9: tvm::SplitMixedModule(tvm::IRModule, tvm::Target const&, tvm::Target const&)
8: tvm::ApplyPasses(tvm::IRModule, tvm::transform::Sequential)
7: tvm::transform::Pass::operator()(tvm::IRModule) const
6: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
5: tvm::transform::SequentialNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
4: tvm::transform::Pass::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
3: tvm::transform::ModulePassNode::operator()(tvm::IRModule, tvm::transform::PassContext const&) const
2: _ZN3tvm7runtime13PackedFuncObj
1: tvm::runtime::TypedPackedFunc<tvm::IRModule (tvm::IRModule, tvm::transform::PassContext)>::AssignTypedLambda<tvm::tir::transform::MakePackedAPI()::{lambda(tvm::IRModule, tvm::transform::PassContext)#1}>(tvm::tir::transform::MakePackedAPI()::{lambda(tvm::IRModule, tvm::transform::PassContext)#1})::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}::operator()(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*) const
0: tvm::tir::MakePackedAPI(tvm::tir::PrimFunc)
File "/data/tmp/test/llm/mlc-llm/3rdparty/tvm/src/tir/transforms/make_packed_api.cc", line 374
After revert to a0484bd53854a508283be47d62b704b2c737259d, with https://github.com/mlc-ai/mlc-llm/pull/1985 still get cuda oom for 72B int4 with 4gpus
@tlopex any idea?
@leiwen83 Are you running without quantization? 72B model doesn't seem to fit four 3090 with each having 24GB vRAM.
@leiwen83 Are you running without quantization? 72B model doesn't seem to fit four 3090 with each having 24GB vRAM.
I am running with quantization from HF model: Here is the commands I use for the model generation
python3 -m mlc_llm convert_weight --quantization q4f16_1 /Qwen1.5-72B/ --output Qwen1.5-72B-Chat-GPTQ-Int4_MLC
python3 -m mlc_llm gen_config --conv-template chatml --tensor-parallel-shards 4 --quantization q4f16_1 //Qwen1.5-72B/ --output Qwen1.5-72B-Chat-GPTQ-Int4_MLC/
python3 -m mlc_llm compile Qwen1.5-72B-Chat-GPTQ-Int4_MLC/ -o Qwen1.5-72B-Chat-q4f16_1-MLC.so
I find there is setting named "MLC_INTERNAL_PRESHARD_NUM". Do I need to set this when do the convert model and before serving?
No, MLC_INTERNAL_PRESHARD_NUM
is not related.
I just saw that Qwen TP support was just enabled yesterday in https://github.com/mlc-ai/mlc-llm/pull/1985. Could you check out the latest main branch and try gen_config
and compile
again? convert_weight
is not need for the second time.
Still get error...
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module>
args: argparse.Namespace = parse_args_and_initialize()
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize
engine = async_engine.AsyncThreadedEngine(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__
kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length
assert max_total_sequence_length > 0, (
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
@leiwen83 Sorry for late reply. Maybe you can try again with the latest main branch since there is a update today.
After pull latest commit, it would report error when converting the model.
# python3 -m mlc_llm convert_weight --quantization q4f16_1 Qwen1.5-72B-Chat --output Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:03] INFO auto_config.py:115: Found model configuration: Qwen1.5-72B-Chat/config.json
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:0
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:1
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:2
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:3
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:4
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:5
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:6
[2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:7
[2024-03-28 14:47:12] INFO auto_device.py:85: Not found device: rocm:0
[2024-03-28 14:47:13] INFO auto_device.py:85: Not found device: metal:0
[2024-03-28 14:47:14] INFO auto_device.py:85: Not found device: vulkan:0
[2024-03-28 14:47:15] INFO auto_device.py:85: Not found device: opencl:0
[2024-03-28 14:47:15] INFO auto_device.py:33: Using device: cuda:0
[2024-03-28 14:47:15] INFO auto_weight.py:70: Finding weights in: Qwen1.5-72B-Chat
[2024-03-28 14:47:15] INFO auto_weight.py:136: Not found Huggingface PyTorch
[2024-03-28 14:47:15] INFO auto_weight.py:143: Found source weight format: huggingface-safetensor. Source configuration: Qwen1.5-72B-Chat/model.safetensors.index.json
[2024-03-28 14:47:15] INFO auto_weight.py:106: Using source weight configuration: Qwen1.5-72B-Chat/model.safetensors.index.json. Use `--source` to override.
[2024-03-28 14:47:15] INFO auto_weight.py:110: Using source weight format: huggingface-safetensor. Use `--source-format` to override.
[2024-03-28 14:47:15] INFO auto_config.py:153: Found model type: qwen2. Use `--model-type` to override.
Weight conversion with arguments:
--config Qwen1.5-72B-Chat/config.json
--quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7)
--model-type qwen2
--device cuda:0
--source Qwen1.5-72B-Chat/model.safetensors.index.json
--source-format huggingface-safetensor
--output Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:15] INFO qwen2_model.py:48: context_window_size not found in config.json. Falling back to max_position_embeddings (32768)
[2024-03-28 14:47:15] INFO qwen2_model.py:65: prefill_chunk_size defaults to context_window_size (32768)
Start storing to cache Qwen1.5-72B-Chat_tvm
[2024-03-28 14:47:35] INFO huggingface_loader.py:184: Loading HF parameters from: Qwen1.5-72B-Chat/model-00038-of-00038.safetensors
[2024-03-28 14:47:40] INFO group_quantization.py:234: Compiling quantize function for key: ((152064, 8192), float16, cuda, axis=1, output_transpose=False)
[2024-03-28 14:47:41] INFO huggingface_loader.py:166: [Quantized] Parameter: "lm_head.q_weight", shape: (152064, 1024), dtype: uint32
0%| | 0/563 [00:06<?, ?it/s]
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module>
main()
File "/data/mlc-llm/python/mlc_llm/__main__.py", line 28, in main
cli.main(sys.argv[2:])
File "/data/mlc-llm/python/mlc_llm/cli/convert_weight.py", line 87, in main
convert_weight(
File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 182, in convert_weight
_convert_args(args)
File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 146, in _convert_args
tvmjs.dump_ndarray_cache(
File "/data/mlc-llm/3rdparty/tvm/python/tvm/contrib/tvmjs.py", line 210, in dump_ndarray_cache
for k, origin_v in param_generator:
File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 130, in _param_generator
for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs):
File "/data/mlc-llm/python/mlc_llm/loader/huggingface_loader.py", line 122, in load
if name in preshard_funcs:
TypeError: argument of type 'NoneType' is not iterable
If use previous converted but do genconfig/compile again, still get similar error as:
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module>
args: argparse.Namespace = parse_args_and_initialize()
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize
engine = async_engine.AsyncThreadedEngine(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__
kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length(
File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length
assert max_total_sequence_length > 0, (
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
@leiwen83: I solved this error by adding the following line of code above line 115 in huggingface_loader.py
:
preshard_funcs = {}
After pull latest commit, it would report error when converting the model.
# python3 -m mlc_llm convert_weight --quantization q4f16_1 Qwen1.5-72B-Chat --output Qwen1.5-72B-Chat_tvm [2024-03-28 14:47:03] INFO auto_config.py:115: Found model configuration: Qwen1.5-72B-Chat/config.json [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:0 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:1 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:2 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:3 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:4 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:5 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:6 [2024-03-28 14:47:11] INFO auto_device.py:76: Found device: cuda:7 [2024-03-28 14:47:12] INFO auto_device.py:85: Not found device: rocm:0 [2024-03-28 14:47:13] INFO auto_device.py:85: Not found device: metal:0 [2024-03-28 14:47:14] INFO auto_device.py:85: Not found device: vulkan:0 [2024-03-28 14:47:15] INFO auto_device.py:85: Not found device: opencl:0 [2024-03-28 14:47:15] INFO auto_device.py:33: Using device: cuda:0 [2024-03-28 14:47:15] INFO auto_weight.py:70: Finding weights in: Qwen1.5-72B-Chat [2024-03-28 14:47:15] INFO auto_weight.py:136: Not found Huggingface PyTorch [2024-03-28 14:47:15] INFO auto_weight.py:143: Found source weight format: huggingface-safetensor. Source configuration: Qwen1.5-72B-Chat/model.safetensors.index.json [2024-03-28 14:47:15] INFO auto_weight.py:106: Using source weight configuration: Qwen1.5-72B-Chat/model.safetensors.index.json. Use `--source` to override. [2024-03-28 14:47:15] INFO auto_weight.py:110: Using source weight format: huggingface-safetensor. Use `--source-format` to override. [2024-03-28 14:47:15] INFO auto_config.py:153: Found model type: qwen2. Use `--model-type` to override. Weight conversion with arguments: --config Qwen1.5-72B-Chat/config.json --quantization GroupQuantize(name='q4f16_1', kind='group-quant', group_size=32, quantize_dtype='int4', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=8, num_storage_per_group=4, max_int_value=7) --model-type qwen2 --device cuda:0 --source Qwen1.5-72B-Chat/model.safetensors.index.json --source-format huggingface-safetensor --output Qwen1.5-72B-Chat_tvm [2024-03-28 14:47:15] INFO qwen2_model.py:48: context_window_size not found in config.json. Falling back to max_position_embeddings (32768) [2024-03-28 14:47:15] INFO qwen2_model.py:65: prefill_chunk_size defaults to context_window_size (32768) Start storing to cache Qwen1.5-72B-Chat_tvm [2024-03-28 14:47:35] INFO huggingface_loader.py:184: Loading HF parameters from: Qwen1.5-72B-Chat/model-00038-of-00038.safetensors [2024-03-28 14:47:40] INFO group_quantization.py:234: Compiling quantize function for key: ((152064, 8192), float16, cuda, axis=1, output_transpose=False) [2024-03-28 14:47:41] INFO huggingface_loader.py:166: [Quantized] Parameter: "lm_head.q_weight", shape: (152064, 1024), dtype: uint32 0%| | 0/563 [00:06<?, ?it/s] Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/mlc-llm/python/mlc_llm/__main__.py", line 47, in <module> main() File "/data/mlc-llm/python/mlc_llm/__main__.py", line 28, in main cli.main(sys.argv[2:]) File "/data/mlc-llm/python/mlc_llm/cli/convert_weight.py", line 87, in main convert_weight( File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 182, in convert_weight _convert_args(args) File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 146, in _convert_args tvmjs.dump_ndarray_cache( File "/data/mlc-llm/3rdparty/tvm/python/tvm/contrib/tvmjs.py", line 210, in dump_ndarray_cache for k, origin_v in param_generator: File "/data/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 130, in _param_generator for name, param in loader.load(device=args.device, preshard_funcs=preshard_funcs): File "/data/mlc-llm/python/mlc_llm/loader/huggingface_loader.py", line 122, in load if name in preshard_funcs: TypeError: argument of type 'NoneType' is not iterable
If use previous converted but do genconfig/compile again, still get similar error as:
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 56, in <module> args: argparse.Namespace = parse_args_and_initialize() File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/server/__main__.py", line 46, in parse_args_and_initialize engine = async_engine.AsyncThreadedEngine( File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/async_engine.py", line 153, in __init__ kv_cache_config.max_total_sequence_length = _estimate_max_total_sequence_length( File "/data/tmp/test/llm/mlc-llm/python/mlc_llm/serve/engine.py", line 228, in _estimate_max_total_sequence_length assert max_total_sequence_length > 0, ( AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
@leiwen83: I solved this error by adding the following line of code above line 115 in
huggingface_loader.py
:preshard_funcs = {}
Yep, with this, "argument of type 'NoneType' is not iterable" Error got fixed.
But below error still existed.
AssertionError: Cannot estimate KV cache capacity. The model weight size 11219714048.0 may be larger than GPU memory size 25447170048
Hi folks, sorry for the delayed response here. Last week we sent a patch that can fix this issue https://github.com/mlc-ai/mlc-llm/pull/2278. Note that we will need to rerun mlc_llm gen_config
so that the config file is updated. Would appreciate if you can try again.
This should be fixed
It seems to me that for now mlc is trying to loading all weight into one gpu card?
After convert_weight/gen_config/compile, it report error when ready to serve:
If try set MLC_GPU_SIZE_BYTES=103079215104, which is memory sum number for 4gpu card. it would report error when loading weight: