microsoft / Pengi

An Audio Language model for Audio Tasks
https://arxiv.org/abs/2305.11834
MIT License
284 stars 15 forks source link

Getting generic text output rather than anything related to the audio #13

Closed TKELKAR123 closed 4 months ago

TKELKAR123 commented 7 months ago

Keep getting generic output:

[([' man <|endoftext|>The following is a list of the most common phrases used in the English language. <|endoftext|>The following is a list of the most common phrases used in the English language. <|endoftext|>The following is a list of the most common phrases used in the English language.\n\nThe following is a list of the most common phrases used in the English language.\n\nThe following is a list of the most common phrases used in the English language.\n\nThe following is a list of the', ' man <|endoftext|>The following is a list of the most popular songs from the popular music video game, The Legend of Zelda: Breath of the Wild. <|endoftext|>The following is a list of the most popular songs from the popular video game, The Legend of Zelda: Breath of the Wild. <|endoftext|>The following is a list of the most popular songs from the popular video game, The Legend of Zelda: Breath of the Wild.\n\nContents show]\n\nThe Legend of Zelda: Breath of', ' man <|endoftext|>The following is a list of the most popular songs in the world. <|endoftext|>The following is a list of the most popular songs in the world. <|endoftext|>The following is a list of the most popular songs in the world.\n\nThe following is a list of the most popular songs in the world.\n\nThe following is a list of the most popular songs in the world.\n\nThe following is a list of the most popular songs in the world.\n\n\n'], tensor([-0.0208, -0.0332, -0.0334]))]

This is my input:

from wrapper import PengiWrapper as Pengi

print("Starting script...")

pengi = Pengi(config="base") #base or base_no_text_enc

print("Pengi initialized...")

transcribe_audio = pengi.generate(audio_paths=["output.wav"],
                                  text_prompts=["Transcribe the audio."],
                                  add_texts=["Alphanumeric sequence"],
                                  max_len=100,
                                  beam_size=3,
                                  temperature=0.1,
                                  stop_token="<|ENDOFTEXT|>",
                                  )

print("Audio transcribed...")

# Print the result
print(transcribe_audio)

My audio file works - is the audio file not the right format?

AFMSB commented 6 months ago

How did you get it to run, i am stuck in the model checkpoint loading (I am using a Macbook m1 pro), i already tried pip and i am now using conda.

RuntimeError: Error(s) in loading state_dict for PENGI: Unexpected key(s) in state_dict: "caption_encoder.base.embeddings.position_ids", "caption_decoder.gpt.transformer.h.0.attn.bias", "caption_decoder.gpt.transformer.h.0.attn.maskedbias", "caption (it keeps going..)

soham97 commented 6 months ago

@AFMSB It looks similar to this issue: https://github.com/microsoft/Pengi/issues/11