can not reproduce the inference example in internVL2-1B

yanyanyufei1 commented 3 months ago

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等) GPU:Tesla V100 Ubuntu 18.04.1 LTS CUDA Version: 11.7 torch Version:2.3.0+cu118

Additional context Add any other context about the problem here(在这里补充其他信息) (internvl2) q00813667@g5500-v100-14:~/code/internvl-1b$ CUDA_VISIBLE_DEVICES=6 swift infer --model_type internvl2-1b --model_id_or_path /home/qyf/code/internvl-1b/weight/ run sh: python /home/qyf/code/internvl-1b/swift/swift/cli/infer.py --model_type internvl2-1b --model_id_or_path /home/qyf/code/internvl-1b/weight/ [INFO:swift] Successfully registered /home/qyf/code/internvl-1b/swift/swift/llm/data/dataset_info.json [INFO:swift] Start time of running main: 2024-08-09 09:50:26.600485 [INFO:swift] ckpt_dir: None [INFO:swift] Due to ckpt_dir being None, load_args_from_ckpt_dir is set to False. [INFO:swift] Setting template_type: internvl2 [INFO:swift] Setting self.eval_human: True [INFO:swift] Setting overwrite_generation_config: False [INFO:swift] args: InferArguments(model_type='internvl2-1b', model_id_or_path='/home/qyf/code/internvl-1b/weight', model_revision='master', sft_type='full', template_type='internvl2', infer_backend='pt', ckpt_dir=None, result_dir=None, load_args_from_ckpt_dir=False, load_dataset_config=False, eval_human=True, seed=42, dtype='AUTO', dataset=[], val_dataset=[], dataset_seed=42, dataset_test_ratio=0.01, show_dataset_sample=10, save_result=True, system=None, tools_prompt='react_en', max_length=None, truncation_strategy='delete', check_dataset_strategy='none', model_name=[None, None], model_author=[None, None], quant_method=None, quantization_bit=0, hqq_axis=0, hqq_dynamic_config_path=None, bnb_4bit_comp_dtype='AUTO', bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_quant_storage=None, max_new_tokens=2048, do_sample=True, temperature=0.3, top_k=20, top_p=0.7, repetition_penalty=1.0, num_beams=1, stop_words=[], rope_scaling=None, use_flash_attn=None, ignore_args_error=False, stream=True, merge_lora=False, merge_device_map='cpu', save_safetensors=True, overwrite_generation_config=False, verbose=None, local_repo_path=None, custom_register_path=None, custom_dataset_info=None, device_map_config_path=None, device_max_memory=[], hub_token=None, gpu_memory_utilization=0.9, tensor_parallel_size=1, max_num_seqs=256, max_model_len=None, disable_custom_all_reduce=True, enforce_eager=False, vllm_enable_lora=False, vllm_max_lora_rank=16, lora_modules=[], image_input_shape=None, image_feature_size=None, tp=1, cache_max_entry_count=0.8, quant_policy=0, vision_batch_size=1, self_cognition_sample=0, train_dataset_sample=-1, val_dataset_sample=None, safe_serialization=None, model_cache_dir=None, merge_lora_and_save=None, custom_train_dataset_path=[], custom_val_dataset_path=[], vllm_lora_modules=None) [INFO:swift] Global seed set to 42 [INFO:swift] device_count: 1 [INFO:swift] Loading the model using model_dir: /home/qyf/code/internvl-1b/weight [INFO:swift] Setting torch_dtype: torch.bfloat16 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO:swift] model_kwargs: {'device_map': 'cuda:0'} FlashAttention is not installed. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. Warning: Flash Attention is not available, use_flash_attn is set to False. [INFO:swift] model.max_model_len: 32768 [INFO:swift] model_config: InternVLChatConfig { "_commit_hash": null, "_name_or_path": "/home/qyf/code/internvl-1b/weight", "architectures": [ "InternVLChatModel" ], "auto_map": { "AutoConfig": "configuration_internvl_chat.InternVLChatConfig", "AutoModel": "modeling_internvl_chat.InternVLChatModel", "AutoModelForCausalLM": "modeling_internvl_chat.InternVLChatModel" }, "downsample_ratio": 0.5, "dynamic_image_size": true, "force_image_size": 448, "hidden_size": 896, "llm_config": { "_name_or_path": "Qwen/Qwen2-0.5B-Instruct", "add_cross_attention": false, "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "attn_implementation": "eager", "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": 151643, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": 151645, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "silu", "hidden_size": 896, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "initializer_range": 0.02, "intermediate_size": 4864, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "length_penalty": 1.0, "max_length": 20, "max_position_embeddings": 32768, "max_window_layers": 24, "min_length": 0, "model_type": "qwen2", "no_repeat_ngram_size": 0, "num_attention_heads": 14, "num_beam_groups": 1, "num_beams": 1, "num_hidden_layers": 24, "num_key_value_heads": 2, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "prefix": null, "problem_type": null, "pruned_heads": {}, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "rms_norm_eps": 1e-06, "rope_theta": 1000000.0, "sep_token_id": null, "sliding_window": 32768, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_cache": true, "use_sliding_window": false, "vocab_size": 151655 }, "max_dynamic_patch": 12, "max_position_embeddings": 32768, "min_dynamic_patch": 1, "model_type": "internvl_chat", "ps_version": "v2", "select_layer": -1, "template": "Hermes-2", "torch_dtype": "bfloat16", "transformers_version": null, "use_backbone_lora": 0, "use_llm_lora": 0, "use_thumbnail": true, "vision_config": { "_name_or_path": "", "add_cross_attention": false, "architectures": [ "InternVisionModel" ], "attention_dropout": 0.0, "bad_words_ids": null, "begin_suppress_tokens": null, "bos_token_id": null, "chunk_size_feed_forward": 0, "cross_attention_hidden_size": null, "decoder_start_token_id": null, "diversity_penalty": 0.0, "do_sample": false, "drop_path_rate": 0.0, "dropout": 0.0, "early_stopping": false, "encoder_no_repeat_ngram_size": 0, "eos_token_id": null, "exponential_decay_length_penalty": null, "finetuning_task": null, "forced_bos_token_id": null, "forced_eos_token_id": null, "hidden_act": "gelu", "hidden_size": 1024, "id2label": { "0": "LABEL_0", "1": "LABEL_1" }, "image_size": 448, "initializer_factor": 1.0, "initializer_range": 0.02, "intermediate_size": 4096, "is_decoder": false, "is_encoder_decoder": false, "label2id": { "LABEL_0": 0, "LABEL_1": 1 }, "layer_norm_eps": 1e-06, "length_penalty": 1.0, "max_length": 20, "min_length": 0, "model_type": "intern_vit_6b", "no_repeat_ngram_size": 0, "norm_type": "layer_norm", "num_attention_heads": 16, "num_beam_groups": 1, "num_beams": 1, "num_channels": 3, "num_hidden_layers": 24, "num_return_sequences": 1, "output_attentions": false, "output_hidden_states": false, "output_scores": false, "pad_token_id": null, "patch_size": 14, "prefix": null, "problem_type": null, "pruned_heads": {}, "qk_normalization": false, "qkv_bias": true, "remove_invalid_values": false, "repetition_penalty": 1.0, "return_dict": true, "return_dict_in_generate": false, "sep_token_id": null, "suppress_tokens": null, "task_specific_params": null, "temperature": 1.0, "tf_legacy_loss": false, "tie_encoder_decoder": false, "tie_word_embeddings": true, "tokenizer_class": null, "top_k": 50, "top_p": 1.0, "torch_dtype": "bfloat16", "torchscript": false, "transformers_version": "4.37.2", "typical_p": 1.0, "use_bfloat16": true, "use_flash_attn": true } }

[INFO:swift] generation_config: GenerationConfig { "do_sample": true, "eos_token_id": 151645, "max_new_tokens": 2048, "pad_token_id": 151643, "temperature": 0.3, "top_k": 20, "top_p": 0.7 }

[INFO:swift] [vision_model.embeddings.class_embedding]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.embeddings.position_embedding]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.embeddings.patch_embedding.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.embeddings.patch_embedding.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.ls1]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.ls2]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.attn.qkv.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.attn.qkv.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.attn.proj.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.attn.proj.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.mlp.fc1.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.mlp.fc1.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.mlp.fc2.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.mlp.fc2.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.norm1.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.norm1.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.norm2.weight]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.0.norm2.bias]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.1.ls1]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] [vision_model.encoder.layers.1.ls2]: requires_grad=False, dtype=torch.bfloat16, device=cuda:0 [INFO:swift] ... [INFO:swift] InternVLChatModel( (vision_model): InternVisionModel( (embeddings): InternVisionEmbeddings( (patch_embedding): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14)) ) (encoder): InternVisionEncoder( (layers): ModuleList( (0-23): 24 x InternVisionEncoderLayer( (attn): InternAttention( (qkv): Linear(in_features=1024, out_features=3072, bias=True) (attn_drop): Dropout(p=0.0, inplace=False) (proj_drop): Dropout(p=0.0, inplace=False) (proj): Linear(in_features=1024, out_features=1024, bias=True) ) (mlp): InternMLP( (act): GELUActivation() (fc1): Linear(in_features=1024, out_features=4096, bias=True) (fc2): Linear(in_features=4096, out_features=1024, bias=True) ) (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True) (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True) (drop_path1): Identity() (drop_path2): Identity() ) ) ) ) (language_model): Qwen2ForCausalLM( (model): Qwen2Model( (embed_tokens): Embedding(151655, 896) (layers): ModuleList( (0-23): 24 x Qwen2DecoderLayer( (self_attn): Qwen2SdpaAttention( (q_proj): Linear(in_features=896, out_features=896, bias=True) (k_proj): Linear(in_features=896, out_features=128, bias=True) (v_proj): Linear(in_features=896, out_features=128, bias=True) (o_proj): Linear(in_features=896, out_features=896, bias=False) (rotary_emb): Qwen2RotaryEmbedding() ) (mlp): Qwen2MLP( (gate_proj): Linear(in_features=896, out_features=4864, bias=False) (up_proj): Linear(in_features=896, out_features=4864, bias=False) (down_proj): Linear(in_features=4864, out_features=896, bias=False) (act_fn): SiLU() ) (input_layernorm): Qwen2RMSNorm() (post_attention_layernorm): Qwen2RMSNorm() ) ) (norm): Qwen2RMSNorm() ) (lm_head): Linear(in_features=896, out_features=151655, bias=False) ) (mlp1): Sequential( (0): LayerNorm((4096,), eps=1e-05, elementwise_affine=True) (1): Linear(in_features=4096, out_features=896, bias=True) (2): GELU(approximate='none') (3): Linear(in_features=896, out_features=896, bias=True) ) ) [INFO:swift] InternVLChatModel: 938.1590M Params (0.0000M Trainable [0.0000%]), 100.6641M Buffers. [INFO:swift] system: 你是由上海人工智能实验室联合商汤科技开发的书生多模态大模型，英文名叫InternVL, 是一个有用无害的人工智能助手。 [INFO:swift] Input exit or quit to exit the conversation. [INFO:swift] Input multi-line to switch to multi-line input mode. [INFO:swift] Input reset-system to reset the system and clear the history. [INFO:swift] Input clear to clear the history. [INFO:swift] Please enter the conversation content first, followed by the path to the multimedia file. <<< describe the image Input an image path or URL <<< http://modelscope-open.oss-cn-hangzhou.aliyuncs.com/images/cat.png Exception in thread Thread-1 (generate): Traceback (most recent call last): File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/threading.py", line 953, in run self._target(*self._args, self._kwargs) File "/home/qyf/code/internvl-1b/swift/swift/llm/utils/model.py", line 4182, in _new_generate return generate(*args, *kwargs) File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/qyf/.cache/huggingface/modules/transformers_modules/weight/modeling_internvl_chat.py", line 334, in generate outputs = self.language_model.generate( File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 1334, in generate inputs_tensor, model_input_name, model_kwargs = self._prepare_model_inputs( File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 402, in _prepare_model_inputs model_kwargs["input_ids"] = self._maybe_initialize_input_ids_for_generation( File "/home/qyf/software/miniconda3/envs/internvl2/lib/python3.10/site-packages/transformers/generation/utils.py", line 435, in _maybe_initialize_input_ids_for_generation raise ValueError("bos_token_id has to be defined when no input_ids are provided.") ValueError: bos_token_id has to be defined when no input_ids are provided.

Jintao-Huang commented 3 months ago

试试升级一下ms-swift

yanyanyufei1 commented 3 months ago

ms-swift的版本是2.3.0.dev0 还试了2.2.5之类的，transformer的版本也换过几个了还是有同样的问题

Jintao-Huang commented 3 months ago

你试试重新拉取一下internvl-1b的repo代码或许可以解决.

我这里没有问题. ms-swift==2.2.5

Jintao-Huang commented 3 months ago

解决了吗

yanyanyufei1 commented 3 months ago

稍等手头有点别的事今天下午能看这个

yanyanyufei1 commented 3 months ago

我重新下载了internvl2-1b 下载方式如下还是不行 from modelscope import snapshot_download model_dir = snapshot_download('OpenGVLab/InternVL2-1B')

同样的环境internvl2-2b可以

Betty-J commented 2 months ago

I encountered the same problem, have you solved it?

Jintao-Huang commented 2 months ago

The version of Transformers is?

Betty-J commented 2 months ago

The version of Transformers is? transformers 4.37.2 Exactly the same error

Jintao-Huang commented 2 months ago

I tested transformers==4.44.* here, and it works fine.

Jintao-Huang commented 2 months ago

The version of Transformers is? transformers 4.37.2 Exactly the same error

Can reproduce the situation.

Betty-J commented 2 months ago

I tested transformers==4.44.* here, and it works fine.

thanks, it works

modelscope / ms-swift

can not reproduce the inference example in internVL2-1B #1649