huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.65k stars 26.21k forks source link

chinese testdata were transcribed as english #21994

Closed ben-8878 closed 1 year ago

ben-8878 commented 1 year ago

when adding follow codes in a asr server, i send chinese asr data but i get english result. I don't know how to set the language. and try to use "forced_decoder_ids" to set the language, it failed.

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-large", device=1)
    #transcriber.model.config.forced_decoder_ids = (transcriber.tokenizer.get_decoder_prompt_ids(language="zh", task="transcribe"))
transcriber.model.config.forced_decoder_ids = (transcriber.tokenizer.get_decoder_prompt_ids(language="zh", task="transcribe"))
result = transcriber(audio_bytes, chunk_length_s=30)
print(result)

my transformers version is 4.26.1

sgugger commented 1 year ago

cc @ArthurZucker and @sanchit-gandhi though this question would be more appropriate for the forums.

ArthurZucker commented 1 year ago

Hey, this is related to the update of the generate() function. The issue is that you are not modifying the model.generation_config. If you want to set the language in a proper manner, the following will work:

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-large", device=1)
transcriber.model.generation_config.forced_decoder_ids = transcriber.processor.get_decoder_prompt_ids(language="zh", task="transcribe")
result = transcriber(audio_bytes, chunk_length_s=30)
print(result)

We updated the generation config which by defaults should automatically detect the language, but is set to translate and not transcribe. cc @sanchit-gandhi for visibility, this was introduced by #20388

sanchit-gandhi commented 1 year ago

Resolved in https://github.com/huggingface/transformers/pull/21965 - Whisper now respects the config.forced_decoder_ids if the language is not set in the args / generation_config

The most up-to-date way of passing the language is to use the args if possible:

result = transcriber(audio_bytes, chunk_length_s=30, generate_kwargs={"language":"zh"})
ben-8878 commented 1 year ago

Resolve in #21965 - Whisper now respects the config.forced_decoder_ids if the language is not set in the args / generation_config

The most up-to-date way of passing the language is to use the args if possible:

result = transcriber(audio_bytes, chunk_length_s=30, generate_kwargs={"language":"zh"})

@sanchit-gandhi upgrade transformers to version 4.27.1 and try it again, but get follow error:

   f"Unsupported language: {self.language}. Language should be one of:"
  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'
ben-8878 commented 1 year ago

@sgugger another thing is that I using the pipeline get a Translation result not a *Transcription result. how to specify Transcription tasks and language with the pipline.

from transformers import pipeline

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
ArthurZucker commented 1 year ago

I am sorry but I can't reproduce your errors. The following notebook has examples of setting the task and the language on whisper small and both work. Did you run pip install --upgrade transformers? Here is my output (so expected behaviour)

image
ben-8878 commented 1 year ago

I am sorry but I can't reproduce your errors. The following notebook has examples of setting the task and the language on whisper small and both work. Did you run pip install --upgrade transformers? Here is my output (so expected behaviour) image

@ArthurZucker yes, I have run pip install --upgrade transformers and i follow https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor#scrollTo=vqXoVLesTUE6 , I still get a error:

    self._validate_model_kwargs(model_kwargs.copy())
  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1090, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['task', 'language'] (note: typos in the generate arguments will also show up in this list)
ben-8878 commented 1 year ago

@ArthurZucker when transformers 4.26.1 is the latest version, I try it, it failed. Now i update it to 4.27.2, it works.

ben-8878 commented 1 year ago

@ArthurZucker how I to modify parameter "condition_on_previous_text"? This parameter is provided by whisper and its important for me .

  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py", line 1606, in generate
    return super().generate(
  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1213, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1105, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['condition_on_previous_text'] (note: typos in the generate arguments will also show up in this list)
ArthurZucker commented 1 year ago

This is not yet available in the HuggingFace implementation. The PR is currently ongoing, see here #21491

ben-8878 commented 1 year ago

@ArthurZucker actually, I still have some problem as above,(transformers 4.27.2) when I use transformer pipeline and whisper pipeline recognize a wave file ,all is normal, I use micphone to record Chinese wav bytes and send bytes to transformers pipeline sever , it is not normal I get above abnormal result (not Chinese recognition result, for example "You." and other english words), but I use micphone to record Chinese wav bytes and send bytes to whisper pipeline sever , it is normal, so I'm confused.

ArthurZucker commented 1 year ago

Can you show me exactly how you are sending to transformers pipeline server so that I can check how you are calling the model?

ben-8878 commented 1 year ago

Can you show me exactly how you are sending to transformers pipeline server so that I can check how you are calling the model?

@ArthurZucker my transformers pipeline server codes is as follows,and the received bytes data is from a web client reording voice throught Browser microphone :

def forward(model, audio_bytes):
    #print(len(audio_bytes))
    text = model(audio_bytes, chunk_length_s=30, generate_kwargs = {"task":"transcribe", "language":"<|zh|>"})['text']
    return text

def recognize(websocket, path):
    global model
    global args
    global loop
    global pool
    global vad
    #global seg_model
    rec = None
    phrase_list = None
    sample_rate = args.sample_rate
    client_ip = websocket.remote_address
    last_message = ""
    audio_bytes = b''
    bytesdata = b''
    wavdir = "./audiodata"
    uid = str(uuid.uuid1())
    filename = str(client_ip[0])+"_"+uid
    filepath = os.path.join(wavdir, filename+".wav")
    wfile = open(filepath,"wb+")
    phrase_timeout = 4
    max_timeout = 20
    audio_format = "wav"
    channel = 1
    samplewidth = 16

    logging.info('Connection from %s', websocket.remote_address);
    while True:
        message = await websocket.recv()
        if isinstance(message, str):
           if message == '{"eof":1}':
              if len(audio_bytes):
                 if audio_format != "wav":
                    audio_bytes = bytes2wav(audio_bytes, audio_format, sample_rate, channel, samplewidth)
                 else:
                    pass
                 response = await loop.run_in_executor(pool, forward, model, audio_bytes)
                 response = format_result(response)
                 print("last"+response)
                 await websocket.send(response)
              else:
                 await websocket.send("")
              break
           elif "samplerate" in message  and "format" in message:
              try:
                json_str = json.loads(message)
                sample_rate = json_str["samplerate"]
                audio_format = json_str["format"]
                samplewidth = json_str["samplewidth"]
                await websocket.send("")
              except:
                await websocket.send("wrong format")
           else:
              await websocket.send("")
        else:
           audio_bytes += message
           #audiotime = audio_length(audio_bytes, audio_format, sample_rate, channel, samplewidth)
           audiotime = len(audio_bytes) / 2 / int(sample_rate)
           #print(audiotime)
           if audiotime  >  max_timeout :
              if audio_format != "wav":
                 audio_bytes = bytes2wav(audio_bytes, audio_format, sample_rate, channel, samplewidth)
              else:
                 pass
              response = await loop.run_in_executor(pool, forward, model, audio_bytes)
              response =  format_result(response)
              print("first"+response)
              audio_bytes = b''
              await websocket.send(response)
           else:
              await websocket.send("")
def start():

    global model
    global args
    global loop
    global pool
    global vad
    logging.basicConfig(level=logging.INFO)

    args = type('', (), {})()

    args.interface = os.environ.get('SERVER_INTERFACE', '0.0.0.0')
    args.port = int(os.environ.get('SERVER_PORT', 40000))
    args.model_path = os.environ.get('MODEL_PATH', 'model')
    #args.seg_model_path = os.environ.get('VOSK_MODEL_PATH', 'seg_model')
    args.sample_rate = float(os.environ.get('SAMPLE_RATE', 16000))

    if len(sys.argv) > 1:
       args.model_path = sys.argv[1]
       #args.seg_model_path = sys.argv[2]
    model = whisper.load_model(args.model_path,device="cpu")
ben-8878 commented 1 year ago

I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)

ArthurZucker commented 1 year ago

Okay sorry if I don't understand completely, I don't see the forward2 being called or passed anywhere right?

ben-8878 commented 1 year ago

Okay sorry if I don't understand completely, I don't see the forward2 being called or passed anywhere right?

update it, forward2 should be forward.

ArthurZucker commented 1 year ago

Ok, 2 things we need to check:

  1. When calling the pipeline, could you check that pipeline.model.generation_config.forced_decoder_ids is properly updates with the language and the task?
  2. Can you also print the language that should be outputed by the generation process (decode_asr called in the pipeline for whisper should output the language that is detected by the model, which could help us understand if the decoding process went well)
ben-8878 commented 1 year ago

Ok, 2 things we need to check:

  1. When calling the pipeline, could you check that pipeline.model.generation_config.forced_decoder_ids is properly updates with the language and the task?

yes, set it as follows: model = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium",device="cpu") model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")

  1. Can you also print the language that should be outputed by the generation process (decode_asr called in the pipeline for whisper should output the language that is detected by the model, which could help us understand if the decoding process went well)

sorry actually when I use transfomer pipline, met the problem and when I use whisper official pipeline , all is ok。

ArthurZucker commented 1 year ago

Okay! After re-reading your issue, I think you said

I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)

So this means we should probably update our ffmpeg_read function. Is that right?

ben-8878 commented 1 year ago

Okay! After re-reading your issue, I think you said

I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)

So this means we should probably update our ffmpeg_read function. Is that right?

yes, transformer's ffmpeg_read leads to my problem.

ben-8878 commented 1 year ago

now we can use the parameters of "fp16" and "condition_on_previous_text"?

ArthurZucker commented 1 year ago

fp16, load_in_8_bits and the jax models if want faster inference yes. Conditioning on previous text, the update on that feature is here #21491 !

ben-8878 commented 1 year ago

how to use "fp16, load_in_8_bits", has sample codes?

ArthurZucker commented 1 year ago

For load in 8 bits you need accelerate and bits-and-bytes:

from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", load_in_8bit=True)

for fp16:

import torch
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", torch_dtype = torch.float16)
ben-8878 commented 1 year ago

I try it with model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", torch_dtype = torch.float16) get errors: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

ArthurZucker commented 1 year ago

The input should also be halved (the audio)

sanchit-gandhi commented 1 year ago

Note that load_in_8bit will give you a nice memory saving (~30%) but will run slower than fp16. This is likely due to the bitsandbytes 8bit matmul algorithm which isn't super optimised for "small" tensors, but rather is designed more for super large LMs.