whisper identified the wrong language

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

132.24k stars 26.34k forks source link

whisper identified the wrong language #23009

Closed LYPinASR closed 1 year ago

LYPinASR commented 1 year ago

Feature request

When I follow the example of long-form transcription for whisper-large with Korean, the result is English. But after finetuning the whisper-large model with some Korean data, the checkpoint can output Korean. I also test other model size, but all the models output English. I was confused about it. How should I do to output Korean with the original model? Thank you!

Motivation

Test whisper in Korean.

Your contribution

Test whisper in Korean.

sgugger commented 1 year ago

Hi there. Questions like this are better suited on the forums or a discussion on the model page as we keep issues for bugs and feature requests only.

chenht2021 commented 1 year ago

If you use pipeline, you should add option like generate_kwargs = {"task":"transcribe", "language":"<|fr|>"}

ref1: https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor#scrollTo=dPD20IkEDsbG ref2: https://github.com/huggingface/transformers/issues/22331

however, I think default task should be "transcribe" not "translate". I insist It's an error.

LYPinASR commented 1 year ago

I have solved the problem. Step 1: Upgrade transformers, unfixed. Step 2: Add option like "generate_kwargs = {"task":"transcribe", "language":"<|fr|>"}", unfixed. Step 3: Add a line like "pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="ko", task="transcribe")", fixed.

However, I still don't understand why the original model output is English but the fine-tuned model output is in Korean.

chenht2021 commented 1 year ago

maybe you can checked your fine-tuned model's config.json or generation_config.json, double check the default task type, I think it's null or "transcribe"

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.