Closed ben-8878 closed 1 year ago
cc @ArthurZucker and @sanchit-gandhi though this question would be more appropriate for the forums.
Hey, this is related to the update of the generate()
function. The issue is that you are not modifying the model.generation_config
. If you want to set the language in a proper manner, the following will work:
transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-large", device=1)
transcriber.model.generation_config.forced_decoder_ids = transcriber.processor.get_decoder_prompt_ids(language="zh", task="transcribe")
result = transcriber(audio_bytes, chunk_length_s=30)
print(result)
We updated the generation config which by defaults should automatically detect the language, but is set to translate
and not transcribe.
cc @sanchit-gandhi for visibility, this was introduced by #20388
Resolved in https://github.com/huggingface/transformers/pull/21965 - Whisper now respects the config.forced_decoder_ids
if the language is not set in the args / generation_config
The most up-to-date way of passing the language is to use the args if possible:
result = transcriber(audio_bytes, chunk_length_s=30, generate_kwargs={"language":"zh"})
Resolve in #21965 - Whisper now respects the
config.forced_decoder_ids
if the language is not set in the args /generation_config
The most up-to-date way of passing the language is to use the args if possible:
result = transcriber(audio_bytes, chunk_length_s=30, generate_kwargs={"language":"zh"})
@sanchit-gandhi upgrade transformers to version 4.27.1 and try it again, but get follow error:
f"Unsupported language: {self.language}. Language should be one of:"
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'WhisperForConditionalGeneration' object has no attribute 'language'
@sgugger another thing is that I using the pipeline get a Translation result not a *Transcription result. how to specify Transcription tasks and language with the pipline.
from transformers import pipeline
transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")
transcriber("https://huggingface.co/datasets/Narsil/asr_dummy/resolve/main/mlk.flac")
I am sorry but I can't reproduce your errors. The following notebook has examples of setting the task and the language on whisper small and both work. Did you run pip install --upgrade transformers
?
Here is my output (so expected behaviour)
I am sorry but I can't reproduce your errors. The following notebook has examples of setting the task and the language on whisper small and both work. Did you run
pip install --upgrade transformers
? Here is my output (so expected behaviour)
@ArthurZucker yes, I have run pip install --upgrade transformers and i follow https://colab.research.google.com/drive/1rS1L4YSJqKUH_3YxIQHBI982zso23wor#scrollTo=vqXoVLesTUE6 , I still get a error:
self._validate_model_kwargs(model_kwargs.copy())
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1090, in _validate_model_kwargs
raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['task', 'language'] (note: typos in the generate arguments will also show up in this list)
@ArthurZucker when transformers 4.26.1 is the latest version, I try it, it failed. Now i update it to 4.27.2, it works.
@ArthurZucker how I to modify parameter "condition_on_previous_text"? This parameter is provided by whisper and its important for me .
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/models/whisper/modeling_whisper.py", line 1606, in generate
return super().generate(
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1213, in generate
self._validate_model_kwargs(model_kwargs.copy())
File "/home/ybZhang/miniconda3/envs/whister/lib/python3.8/site-packages/transformers/generation/utils.py", line 1105, in _validate_model_kwargs
raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['condition_on_previous_text'] (note: typos in the generate arguments will also show up in this list)
This is not yet available in the HuggingFace implementation. The PR is currently ongoing, see here #21491
@ArthurZucker actually, I still have some problem as above,(transformers 4.27.2) when I use transformer pipeline and whisper pipeline recognize a wave file ,all is normal, I use micphone to record Chinese wav bytes and send bytes to transformers pipeline sever , it is not normal I get above abnormal result (not Chinese recognition result, for example "You." and other english words), but I use micphone to record Chinese wav bytes and send bytes to whisper pipeline sever , it is normal, so I'm confused.
Can you show me exactly how you are sending to transformers pipeline server
so that I can check how you are calling the model?
Can you show me exactly how you are
sending to transformers pipeline server
so that I can check how you are calling the model?
@ArthurZucker my transformers pipeline server codes is as follows,and the received bytes data is from a web client reording voice throught Browser microphone :
def forward(model, audio_bytes):
#print(len(audio_bytes))
text = model(audio_bytes, chunk_length_s=30, generate_kwargs = {"task":"transcribe", "language":"<|zh|>"})['text']
return text
def recognize(websocket, path):
global model
global args
global loop
global pool
global vad
#global seg_model
rec = None
phrase_list = None
sample_rate = args.sample_rate
client_ip = websocket.remote_address
last_message = ""
audio_bytes = b''
bytesdata = b''
wavdir = "./audiodata"
uid = str(uuid.uuid1())
filename = str(client_ip[0])+"_"+uid
filepath = os.path.join(wavdir, filename+".wav")
wfile = open(filepath,"wb+")
phrase_timeout = 4
max_timeout = 20
audio_format = "wav"
channel = 1
samplewidth = 16
logging.info('Connection from %s', websocket.remote_address);
while True:
message = await websocket.recv()
if isinstance(message, str):
if message == '{"eof":1}':
if len(audio_bytes):
if audio_format != "wav":
audio_bytes = bytes2wav(audio_bytes, audio_format, sample_rate, channel, samplewidth)
else:
pass
response = await loop.run_in_executor(pool, forward, model, audio_bytes)
response = format_result(response)
print("last"+response)
await websocket.send(response)
else:
await websocket.send("")
break
elif "samplerate" in message and "format" in message:
try:
json_str = json.loads(message)
sample_rate = json_str["samplerate"]
audio_format = json_str["format"]
samplewidth = json_str["samplewidth"]
await websocket.send("")
except:
await websocket.send("wrong format")
else:
await websocket.send("")
else:
audio_bytes += message
#audiotime = audio_length(audio_bytes, audio_format, sample_rate, channel, samplewidth)
audiotime = len(audio_bytes) / 2 / int(sample_rate)
#print(audiotime)
if audiotime > max_timeout :
if audio_format != "wav":
audio_bytes = bytes2wav(audio_bytes, audio_format, sample_rate, channel, samplewidth)
else:
pass
response = await loop.run_in_executor(pool, forward, model, audio_bytes)
response = format_result(response)
print("first"+response)
audio_bytes = b''
await websocket.send(response)
else:
await websocket.send("")
def start():
global model
global args
global loop
global pool
global vad
logging.basicConfig(level=logging.INFO)
args = type('', (), {})()
args.interface = os.environ.get('SERVER_INTERFACE', '0.0.0.0')
args.port = int(os.environ.get('SERVER_PORT', 40000))
args.model_path = os.environ.get('MODEL_PATH', 'model')
#args.seg_model_path = os.environ.get('VOSK_MODEL_PATH', 'seg_model')
args.sample_rate = float(os.environ.get('SAMPLE_RATE', 16000))
if len(sys.argv) > 1:
args.model_path = sys.argv[1]
#args.seg_model_path = sys.argv[2]
model = whisper.load_model(args.model_path,device="cpu")
I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)
Okay sorry if I don't understand completely, I don't see the forward2
being called or passed anywhere right?
Okay sorry if I don't understand completely, I don't see the
forward2
being called or passed anywhere right?
update it, forward2 should be forward.
Ok, 2 things we need to check:
pipeline.model.generation_config.forced_decoder_ids
is properly updates with the language
and the task
? language
that should be outputed by the generation process (decode_asr
called in the pipeline for whisper should output the language that is detected by the model, which could help us understand if the decoding process went well)Ok, 2 things we need to check:
- When calling the pipeline, could you check that
pipeline.model.generation_config.forced_decoder_ids
is properly updates with thelanguage
and thetask
?
yes, set it as follows: model = pipeline(task="automatic-speech-recognition", model="openai/whisper-medium",device="cpu") model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="zh", task="transcribe")
- Can you also print the
language
that should be outputed by the generation process (decode_asr
called in the pipeline for whisper should output the language that is detected by the model, which could help us understand if the decoding process went well)
sorry actually when I use transfomer pipline, met the problem and when I use whisper official pipeline , all is ok。
Okay! After re-reading your issue, I think you said
I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)
So this means we should probably update our ffmpeg_read
function. Is that right?
Okay! After re-reading your issue, I think you said
I have confirmed that ffmpeg_read function(read audio bytes)has some problem and I replace it with whiper provided function, all is normal(both wafile and mic stream)
So this means we should probably update our
ffmpeg_read
function. Is that right?
yes, transformer's ffmpeg_read leads to my problem.
now we can use the parameters of "fp16" and "condition_on_previous_text"?
fp16
, load_in_8_bits
and the jax models if want faster inference yes. Conditioning on previous text, the update on that feature is here #21491 !
how to use "fp16, load_in_8_bits", has sample codes?
For load in 8 bits you need accelerate
and bits-and-bytes
:
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", load_in_8bit=True)
for fp16
:
import torch
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", torch_dtype = torch.float16)
I try it with model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small", torch_dtype = torch.float16)
get errors: RuntimeError: Input type (torch.FloatTensor) and weight type (torch.HalfTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
The input should also be halved (the audio)
Note that load_in_8bit
will give you a nice memory saving (~30%) but will run slower than fp16. This is likely due to the bitsandbytes 8bit matmul algorithm which isn't super optimised for "small" tensors, but rather is designed more for super large LMs.
when adding follow codes in a asr server, i send chinese asr data but i get english result. I don't know how to set the language. and try to use "forced_decoder_ids" to set the language, it failed.
my transformers version is 4.26.1