lyuchenyang / Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Apache License 2.0
1.54k stars 125 forks source link

Performance of the model #22

Closed chatsci closed 1 year ago

chatsci commented 1 year ago

Hello, I tried to load the pre-trained model you provided and run the following example from AVSD data:

  {
        "instruction": "Is the woman already in the room?",
        "input": "",
        "output": "Yes ahe is already in the room",
        "image": null,
        "audio": null,
        "video": "7UPGT.mp4"
    },

Basically, to prepare the whisper model, clip model, and llama model, I used the following:

   # save whisper, clip, and llama models for future use.
from transformers import CLIPModel, LlamaModel
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
from transformers import WhisperForConditionalGeneration
whisper_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-base")
llama7b_model = LlamaModel.from_pretrained("decapoda-research/llama-7b-hf")

clip_model.save_pretrained('pretrained_models/clip_model/')
whisper_model.save_pretrained('pretrained_models/whisper_model/')
llama7b_model.save_pretrained('pretrained_models/llama7b_model/')

To load the macaw model you provided, I used the following:

if name == "main": clip_config = CLIPConfig.from_pretrained('pretrained_models/clip_model/') whisper_config = WhisperConfig.from_pretrained('pretrained_models/whisper_model/') llm_config = AutoConfig.from_pretrained('pretrained_models/llama7b_model/') tokenizer = get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer) llm_config.vocab_size = len(tokenizer) print("llm_config: ", llm_config)

model_config = MM_LLMs_Config(
    n_frames=6, 
    attention_heads=32, 
    image_conv_kernel=48, 
    image_conv_stride=36, 
    video_conv_kernel=36, 
    video_conv_stride=30, 
    audio_conv_kernel=240, 
    audio_conv_stride=220,
    clip_config=clip_config, whisper_config=whisper_config, llm_config=llm_config
)

macaw_model = MM_LLMs.from_pretrained(
    'pretrained_models/macaw/',
    config = model_config,
    # load_in_8bit=True,
    # torch_dtype=torch.float16,
    # device_map=device_map,
)
TOKENIZER =  get_tokenizer("pretrained_models/macaw/", tokenizer_cls=LlamaTokenizer)

I run the model by:

macaw_model.eval()
with torch.no_grad():
    generate_ids = macaw_model(data_item)
print("generate_ids: ", generate_ids)
input_texts = TOKENIZER.batch_decode(data_item["input_ids"], skip_special_tokens=True, clean_up_tokenization_spaces=False)
generated_texts = TOKENIZER.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print("input_texts: ", input_texts)
print("generated_texts: ", generated_texts)

Then I tested the above avsd example. What I get is:

input_texts: ['Below is an instruction that describes a task, with or without input. Write a response that appropriately completes the request.\n\n### Instruction:\nIs the woman already in the room?\n\n### Response:\n\n'] generated_texts: ['\n\n']

So you can see, the output is nonsense. I tried some other examples, and I also tried pure text input, but they results are not satisfying. May I ask what may be wrong?

chatsci commented 1 year ago

For this input example I described above:

{ "instruction": "Is the woman already in the room?", "input": "", "output": "Yes ahe is already in the room", "image": null, "audio": null, "video": "7UPGT.mp4" },

To debug, I printed its model input (feed into macaw, the data_item variable), and it is saved in the following file: https://drive.google.com/file/d/10kJvUA5zvs6PdejTfWu04e102A2ZL13C/view?usp=sharing

Do you think my input is correct, the same as what your model will expect?

chatsci commented 1 year ago

Seems I found the problem: I need to remove the pad_token_id and eos_token_id for the data_item["input_ids"]. Thanks.

WHUCSstetman commented 7 months ago

Hello, I'm also trying to load a pre-trained model, but I couldn't find any relevant information about this in the README file. Could you please share your experience or provide guidance on how to load the pre-trained model? Thanks a lot!