modelscope / dash-infer

DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including x86 and ARMv9.
Apache License 2.0
135 stars 15 forks source link

使用千问2.0-7B加载千问2.5-3B模型提示"Only allowed now, your model Qwen2-7B" #42

Open tianyouyangying opened 3 weeks ago

tianyouyangying commented 3 weeks ago

OPENAI 报错 API Error: Status Code 400, {"object":"error","message":"Only allowed now, your model Qwen2-7B","code":40301}

docker部署。

    --network host \
    -v /data/data/Qwen2.5-3B-Instruct:/workspace/qwen/Qwen2.5-3B-Instruct  \
    -v /data/data/config_qwen_v20_7b.json:/workspace/config_qwen_v20_7b.json \
    dockerpull.com/dashinfer/fschat_ubuntu_x86:v1.2.1 \
    -m /workspace/qwen/Qwen2.5-3B-Instruct \
    /workspace/config_qwen_v20_7b.json
    "model_name": "Qwen2-7B",
    "model_type": "Qwen_v20",
    "model_path": "~/dashinfer_models/",
    "data_type": "float16",
    "device_type": "CPU",
    "device_ids": [
        0
    ],
    "multinode_mode": false,
    "engine_config": {
        "engine_max_length": 2048,
        "engine_max_batch": 8,
        "do_profiling": false,
        "num_threads": 0,
        "matmul_precision": "highest"
    },
    "generation_config": {
        "temperature": 0.7,
        "early_stopping": true,
        "top_k": 20,
        "top_p": 0.8,
        "repetition_penalty": 1.05,
        "presence_penalty": 0.0,
        "min_length": 0,
        "max_length": 2048,
        "no_repeat_ngram_size": 0,
        "eos_token_id": 151643,
        "seed": 1234,
        "stop_words_ids": [
            [
                151643
            ],
            [
                151644
            ],
            [
                151645
            ]
        ]
    },
    "convert_config": {
        "do_dynamic_quantize_convert": false
    },
    "quantization_config": {
        "activation_type": "float16",
        "weight_type": "uint8",
        "SubChannel": true,
        "GroupSize": 512
    }
}

启动日志:

using config file: /workspace/config_qwen_v20_7b.json
2024-10-15 09:58:00 | INFO | controller | args: Namespace(dispatch_method='shortest_queue', host='localhost', port=21001, ssl=False)
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Started server process [16]
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Waiting for application startup.
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Application startup complete.
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Uvicorn running on http://localhost:21001 (Press CTRL+C to quit)
2024-10-15 09:58:00 | INFO | openai_api_server | args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_keys=None, controller_address='http://localhost:21001', host='localhost', port=8000, ssl=False)
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Started server process [17]
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Waiting for application startup.
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Application startup complete.
2024-10-15 09:58:00 | ERROR | stderr | INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
2024-10-15 09:58:04 | INFO | model_worker | Loading the model ['Qwen2.5-3B-Instruct'] on worker 01dbdd5b, worker type: dash-infer worker...
2024-10-15 09:58:04 | INFO | stdout | ### convert_config: {'do_dynamic_quantize_convert': False}
2024-10-15 09:58:04 | INFO | stdout | ### engine_config: {'engine_max_length': 2048, 'engine_max_batch': 8, 'do_profiling': False, 'num_threads': 0, 'matmul_precision': 'highest'}
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20241015 09:58:04.585773    18 thread_pool.h:46] ThreadPool created with: 1
I20241015 09:58:04.586000    18 as_engine.cpp:233] AllSpark Init with Version: 1.2.1/(GitSha1:5ceddf95)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
E20241015 09:58:04.916028    18 as_engine.cpp:931] workers is empty
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████     | 1/2 [00:01<00:01,  1.01s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.31it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:01<00:00,  1.25it/s]
2024-10-15 09:58:06 | ERROR | stderr | 
2024-10-15 09:58:08 | INFO | stdout | trans model from huggingface model: /workspace/qwen/Qwen2.5-3B-Instruct
2024-10-15 09:58:08 | INFO | stdout | Dashinfer model will save to  /root/dashinfer_models/
2024-10-15 09:58:08 | INFO | stdout | ### model_config: {'vocab_size': 151936, 'max_position_embeddings': 32768, 'hidden_size': 2048, 'intermediate_size': 11008, 'num_hidden_layers': 36, 'num_attention_heads': 16, 'use_sliding_window': False, 'sliding_window': 32768, 'max_window_layers': 70, 'num_key_value_heads': 2, 'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-06, 'use_cache': True, 'rope_theta': 1000000.0, 'attention_dropout': 0.0, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': torch.bfloat16, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': True, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_length': 0, 'do_sample': False, 'early_stopping': False, 'num_beams': 1, 'num_beam_groups': 1, 'diversity_penalty': 0.0, 'temperature': 1.0, 'top_k': 50, 'top_p': 1.0, 'typical_p': 1.0, 'repetition_penalty': 1.0, 'length_penalty': 1.0, 'no_repeat_ngram_size': 0, 'encoder_no_repeat_ngram_size': 0, 'bad_words_ids': None, 'num_return_sequences': 1, 'output_scores': False, 'return_dict_in_generate': False, 'forced_bos_token_id': None, 'forced_eos_token_id': None, 'remove_invalid_values': False, 'exponential_decay_length_penalty': None, 'suppress_tokens': None, 'begin_suppress_tokens': None, 'architectures': ['Qwen2ForCausalLM'], 'finetuning_task': None, 'id2label': {0: 'LABEL_0', 1: 'LABEL_1'}, 'label2id': {'LABEL_0': 0, 'LABEL_1': 1}, 'tokenizer_class': None, 'prefix': None, 'bos_token_id': 151643, 'pad_token_id': None, 'eos_token_id': 151645, 'sep_token_id': None, 'decoder_start_token_id': None, 'task_specific_params': None, 'problem_type': None, '_name_or_path': '/workspace/qwen/Qwen2.5-3B-Instruct', '_commit_hash': None, '_attn_implementation_internal': 'sdpa', 'transformers_version': '4.43.1', 'model_type': 'qwen2', 'use_dynamic_ntk': False, 'use_logn_attn': False, 'rotary_emb_base': 1000000.0, 'size_per_head': 128}
2024-10-15 09:58:08 | INFO | stdout | save dimodel to  /root/dashinfer_models/Qwen2-7B_cpu_single_float16.dimodel
2024-10-15 09:58:08 | INFO | stdout | save ditensors to  /root/dashinfer_models/Qwen2-7B_cpu_single_float16.ditensors
2024-10-15 09:58:16 | INFO | stdout | parse weight time:  8.026057004928589
2024-10-15 09:58:16 | INFO | stdout | current allspark version major[ 1 ] minor[ 2 ] patch[ 1 ] commit =  5ceddf95
2024-10-15 09:58:16 | INFO | stdout | calculate md5 of dimodel =  b51d97a3e0a163de5f6123f7ad0fd77e
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      model_name     :  Qwen2-7B_cpu_single_float16
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      model_type     :  Qwen_v20
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      save_dir   :  /root/dashinfer_models/
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      multinode_mode     :  False
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      data_type  :  float16
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      do_dynamic_quantize_convert    :  False
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      use_dynamic_ntk    :  False
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      use_logn_attn  :  False
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      model_sequence_length  :  2048
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      seqlen_extrapolation   :  1.0
2024-10-15 09:58:16 | INFO | stdout | torch build meta:      rotary_base    :  1000000.0
2024-10-15 09:58:16 | INFO | stdout | serialize_model_from_torch: save model = true, time :  8.104979991912842
2024-10-15 09:58:16 | INFO | stdout | convert model from HF finished, build time is 8.105656862258911 seconds
I20241015 09:58:16.746803    18 as_engine.cpp:366] Detect avx512f supported, switch Prefill mode to flash
I20241015 09:58:16.746842    18 as_engine.cpp:384] Build model use following config:
AsModelConfig :
    model_name: Qwen2-7B_cpu_single_float16
    model_path: /root/dashinfer_models/Qwen2-7B_cpu_single_float16.dimodel
    weights_path: /root/dashinfer_models/Qwen2-7B_cpu_single_float16.ditensors
    compute_unit: CPU:0
    num_threads: 12
    matmul_precision: highest
    prefill_mode: AsPrefillFlashV2
    cache_mode: AsCacheDefault
    engine_max_length = 2048
    engine_max_batch = 8

I20241015 09:58:16.746910    18 as_engine.cpp:388] Load model from : /root/dashinfer_models/Qwen2-7B_cpu_single_float16.dimodel
I20241015 09:58:16.747004    18 as_engine.cpp:300] SetDeviceIds: DeviceIDs.size() 1
I20241015 09:58:16.747017    18 as_engine.cpp:307] Start create 1 Device: CPU workers.
I20241015 09:58:16.747486   215 cpu_context.cpp:114] CPUContext::InitMCCL() rank: 0 nRanks: 1
I20241015 09:58:16.827616    18 as_param_check.hpp:342] AsParamGuard check level = CHECKER_NORMAL. Engine version = 1.2 . Weight version = 1.2 . 
I20241015 09:58:16.829321    18 as_engine.cpp:511] Start BuildModel
I20241015 09:58:16.829511   216 as_engine.cpp:521] Start Build model for rank: 0
I20241015 09:58:16.829562   216 weight_manager.cpp:131] Start Loading weight for model RankInfo[0/1]
I20241015 09:58:16.829576   216 weight_manager.cpp:52] Start open model file /root/dashinfer_models/Qwen2-7B_cpu_single_float16.ditensors
I20241015 09:58:16.829613   216 weight_manager.cpp:59] Open model file success. 
I20241015 09:58:16.832871   216 weight_manager.cpp:107] Weight file header parse success...291 weight tensors are going to load. 
I20241015 09:58:21.656690   216 weight_manager.cpp:257] finish weight load for model RankInfo[0/1] time  spend: 4.827 seconds.
I20241015 09:58:21.659478   216 as_engine.cpp:525] Finish Build model for rank: 0
2024-10-15 09:58:21 | INFO | stdout | build model over, build time is 5.124737977981567
I20241015 09:58:21.661041    18 as_engine.cpp:672] StartModel: warming up...
I20241015 09:58:21.661065   217 as_engine.cpp:1612] | AllsparkStat | Req: Running: 0 Pending: 0      Prompt: 0 T/s  Gen: 0 T/s 
2024-10-15 10:02:14 | INFO | stdout | INFO:     127.0.0.1:36730 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:14 | INFO | stdout | INFO:     127.0.0.1:37592 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:15 | INFO | stdout | INFO:     127.0.0.1:36736 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:15 | INFO | stdout | INFO:     127.0.0.1:37592 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:16 | INFO | stdout | INFO:     127.0.0.1:36744 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:16 | INFO | stdout | INFO:     127.0.0.1:37592 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:16 | INFO | stdout | INFO:     127.0.0.1:36746 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:16 | INFO | stdout | INFO:     127.0.0.1:37592 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:24 | INFO | stdout | INFO:     127.0.0.1:60596 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:24 | INFO | stdout | INFO:     127.0.0.1:46342 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:25 | INFO | stdout | INFO:     127.0.0.1:60608 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:25 | INFO | stdout | INFO:     127.0.0.1:46342 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:26 | INFO | stdout | INFO:     127.0.0.1:60624 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:26 | INFO | stdout | INFO:     127.0.0.1:46342 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
2024-10-15 10:02:26 | INFO | stdout | INFO:     127.0.0.1:60632 - "POST /list_models HTTP/1.1" 200 OK
2024-10-15 10:02:26 | INFO | stdout | INFO:     127.0.0.1:46342 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request
laiwenzh commented 1 week ago

你的模型名称是Qwen2.5-3B-Instruct,所以要把"model_name": "Qwen2-7B",这里的model_name改成wen2.5-3B-Instruct