mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.28k stars 1.58k forks source link

[Question] Speculative Decoding Mode #2710

Closed bethalianovike closed 3 months ago

bethalianovike commented 3 months ago

❓ General Questions

How do I get the eagle and medusa mode of the LLM model? I try to do the "convert_weight", "gen_config", and "compile" steps of MLC-LLM with the addition --model-type "eagle" or "medusa" on the command line. But when executing the convert weight step, it gives an error message. Could someone please give me some tips on how to do the speculative decoding on MLC-LLM? Thank you in advance!

For "eagle" mode:

File "/home/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 122, in _param_generator
    loader = LOADER[args.source_format](
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/python/mlc_llm/loader/huggingface_loader.py", line 99, in __init__
    check_parameter_usage(extern_param_map, set(self.torch_to_path.keys()))
  File "/home/mlc-llm/python/mlc_llm/loader/utils.py", line 33, in check_parameter_usage
    raise ValueError(

For "medusa" mode:

  File "/home/mlc-llm/python/mlc_llm/interface/convert_weight.py", line 59, in _convert_args
    model_config = args.model.config.from_file(args.config)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/python/mlc_llm/support/config.py", line 71, in from_file
    return cls.from_dict(json.load(in_file))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/python/mlc_llm/support/config.py", line 51, in from_dict
    return cls(**fields, kwargs=kwargs)  # type: ignore[call-arg]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MedusaConfig.__init__() missing 2 required positional arguments: 'medusa_num_heads' and 'medusa_num_layers'
sunzj commented 3 months ago

As for eagle, seems you need model file: https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B/tree/main then run something like:

mlc_llm convert_weight ./dist/EAGLE-llama2-chat-7B --quantization q4f16_1 -o dist/EAGLE-llama2-chat-7B-q4f16 --model-type "eagle"
mlc_llm gen_config ./dist/EAGLE-llama2-chat-7B --quantization q4f16_1 -o dist/EAGLE-llama2-chat-7B-q4f16 --model-type eagle --conv-template llama-2
mlc_llm compile ./dist/EAGLE-llama2-chat-7B-q4f16/mlc-chat-config.json --device opencl -o dist/libs/EAGLE-llama2-chat-7B-q4f16.so

however, after generate the library and model, i try the following cmd, no respond for request. if don't use speculative mode, it's just fine.

mlc_llm serve dist/Llama-2-7b-chat-hf-q4f16_1/params --model-lib dist/libs/Llama-2-7b-chat-hf-q4f16_1.so --mode local --additional-models dist/EAGLE-llama2-chat-7B-q4f16,dist/libs/EAGLE-llama2-chat-7B-q4f16.so --speculative-mode eagle

bethalianovike commented 3 months ago

Thank you @sunzj! Yes! I am also stuck on that step... Have you already tried to run mlc_llm chat on that EAGLE-llama2-chat-7B? When I try, it gives me a tokenizer error message, do we need to copy the tokenizer file from llama2-7b? (since https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B/tree/main doesn't have tokenizer files on it)

Error message when running mlc_llm chat:

[2024-08-01 09:15:47] INFO engine_base.py:143: Using library model: ../dist/libs/EAGLE-llama2-chat-7B-q4f16_1-MLC_SLM_gpu_1_cuda.so
[09:15:47] /home/mlc-llm/cpp/tokenizers/tokenizers.cc:202: Warning: Tokenizer info is not detected as tokenizer.json is not found. The default tokenizer info will be used.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/mlc-llm/python/mlc_llm/__main__.py", line 64, in <module>
    main()
  File "/home/mlc-llm/python/mlc_llm/__main__.py", line 45, in main
    cli.main(sys.argv[2:])
  File "/home/mlc-llm/python/mlc_llm/cli/chat.py", line 36, in main
    chat(
  File "/home/mlc-llm/python/mlc_llm/interface/chat.py", line 282, in chat
    JSONFFIEngine(
  File "/home/mlc-llm/python/mlc_llm/json_ffi/engine.py", line 255, in __init__
    self.tokenizer = Tokenizer(model_args[0][0])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/python/mlc_llm/tokenizers/tokenizers.py", line 64, in __init__
    self.__init_handle_by_constructor__(
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/object.py", line 145, in __init_handle_by_constructor__
    handle = __init_by_constructor__(fconstructor, args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/_ctypes/packed_func.py", line 262, in __init_handle_by_constructor__
    raise_last_ffi_error()
  File "/home/mlc-llm/3rdparty/tvm/python/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
  File "/home/mlc-llm/cpp/tokenizers/tokenizers.cc", line 459, in operator()
    return Tokenizer::FromPath(path);
                    ^^^^^^^^^^^^^^^^^^
  File "/home/mlc-llm/cpp/tokenizers/tokenizers.cc", line 191, in mlc::llm::Tokenizer::FromPath(tvm::runtime::String const&, std::optional<mlc::llm::TokenizerInfo>)
    LOG(FATAL) << "Cannot find any tokenizer under: " << _path;
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
tvm._ffi.base.TVMError: Traceback (most recent call last):
  1: operator()
        at /home/mlc-llm/cpp/tokenizers/tokenizers.cc:459
  0: mlc::llm::Tokenizer::FromPath(tvm::runtime::String const&, std::optional<mlc::llm::TokenizerInfo>)
        at /home/mlc-llm/cpp/tokenizers/tokenizers.cc:191
  File "/home/mlc-llm/cpp/tokenizers/tokenizers.cc", line 191
TVMError: Cannot find any tokenizer under: ../dist/EAGLE-llama2-chat-7B-q4f16_1-MLC_SLM_gpu_1
sunzj commented 3 months ago

@bethalianovike i don't try run mlc_llm chat, as i know, the eagle model is a forward layer, can't use for chat.

bethalianovike commented 3 months ago

@sunzj Got it! Thanks! Actually, when I look at the github repo, there is another code that we can use to run the speculative decoding, https://github.com/mlc-ai/mlc-llm/blob/main/tests/python/serve/test_serve_engine_spec.py It includes the code that we can use to run testing for small_draft and eagle. I succeed running the small_draft method use that code.

sunzj commented 3 months ago

@bethalianovike hey, try change mode to server, --mode server such as: mlc_llm serve dist/Llama-2-7b-chat-hf-q4f16_1/params --model-lib dist/libs/Llama-2-7b-chat-hf-q4f16_1.so --mode server --additional-models dist/EAGLE-llama2-chat-7B-q4f16,dist/libs/EAGLE-llama2-chat-7B-q4f16.so --speculative-mode eagle

the server mode isn't the root cause, speculative mode needs max_num_sequencelarger than spec_draft_length+1, the default spec_draft_length is 4, so the min max_num_sequence should be 6. try mlc_llm serve dist/Llama-2-7b-chat-hf-q4f16_1/params --model-lib dist/libs/Llama-2-7b-chat-hf-q4f16_1.so --mode server --additional-models dist/EAGLE-llama2-chat-7B-q4f16,dist/libs/EAGLE-llama2-chat-7B-q4f16.so --speculative-mode eagle --device opencl --overrides max_num_sequence=6

bethalianovike commented 3 months ago

@sunzj It can run perfectly! Thanks! Can we get the decode time or decode rate for this speculative decoding result?

sunzj commented 3 months ago

@bethalianovike try curl http://127.0.0.1:8000/metrics how about your result, as i test, eagle seems not imporve the decoding result, however, it could cause by my device. Since i don't use NV GPU, need tuning.

MrRace commented 3 months ago

@sunzj does the Speculative Decoding Mode can been used in Android ?

bethalianovike commented 3 months ago

@sunzj Yes, that's works, thanks! In my device (NVIDIA GeForce RTX 4090), the decode time seems to improve, but the answer without speculative decoding does not match speculative decoding... Is your result also like this?

Here's my decoding result with the main model Llama-2-7b-chat-hf-q0f16:

sunzj commented 3 months ago

@MrRace i am not sure the android can set to speculative mode. as i verify, local mode can support speculative , just set max_num_sequence larger than 6. such as: mlc_llm serve dist/Llama-2-7b-chat-hf-q4f16_1/params --model-lib dist/libs/Llama-2-7b-chat-hf-q4f16_1.so --mode server --additional-models dist/EAGLE-llama2-chat-7B-q4f16,dist/libs/EAGLE-llama2-chat-7B-q4f16.so --speculative-mode eagle --device opencl --overrides max_num_sequence=6

sunzj commented 3 months ago

@bethalianovike it may not cause by speculative decoding, the LLM won't output the same result even if you provide the same prompt. Check the parameter : Temperature. https://www.iguazio.com/glossary/llm-temperature/