k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
473 stars 97 forks source link

Looking for complete conversion from pretrained huggingface model #611

Open lionsheep24 opened 2 weeks ago

lionsheep24 commented 2 weeks ago

Hello, I have pretrained a model with huggingface and attempted to deploy it using the TRTLLM-Triton Server method as documented here. However, I've noticed that the transcription results differ significantly from the original model's performance when using the Transformer pipeline.

Upon further investigation, I compared the mel spectrograms and the decoding results between the TRTLLM implementation and the original pipeline. Both comparisons showed noticeable differences, leading to degraded transcription accuracy in the TRTLLM implementation. In some cases, it even returned a blank string.

Let me share my pipeline implementation

batch_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_ckpt,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
    device_map="cuda:0",
)
asr_pipeline = pipeline(
    "automatic-speech-recognition",
    model=batch_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=False,
    torch_dtype=torch_dtype,
    generate_kwargs={"language": "ko", "num_beams": 1, "do_sample": False},
)
transcription_result = asr_pipeline(audio_array:np.ndarray)

Could anyone help me understand why these discrepancies are occurring and how to resolve them?

Thank you in advance for your assistance

csukuangfj commented 2 weeks ago

@yuekaizhang Could you have a look at this issue?

lionsheep24 commented 2 weeks ago

Let me share my build script for trt-llm.

1. save hf-model 
model = AutoModel.from_pretrained(model_name, use_safetensors=True).half() # save to /workspace/models/whisper-large-v2

2. convert to openai
python3 convert_from_distil_whisper.py --model_name /workspace/models/whisper-large-v2 --output_dir /workspace/models/whisper-openai --output_name large-v2

3. convert to tensorrt-llm
python3 convert_checkpoint.py --model_dir /workspace/models/whisper-openai --output_dir /workspace/models/whisper-tensorrt-llm --model_name large-v2

4. build tensorrt-llm
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/whisper-tensorrt-llm/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 4 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/decoder --output_dir /workspace/models/whisper-tensorrt-llm/1/decoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_beam_width 1 --max_batch_size 4 --max_output_len 100 --max_input_len 1024 --max_encoder_input_len 1500 --gemm_plugin float16 --bert_attention_plugin float16 --gpt_attention_plugin float16 --remove_input_padding disable
trtllm-build --checkpoint_dir /workspace/models/whisper-tensorrt-llm/encoder --output_dir /workspace/models/1/encoder --paged_kv_cache disable --moe_plugin disable --enable_xqa disable --use_custom_all_reduce disable --max_batch_size 16 --gemm_plugin disable --bert_attention_plugin float16 --remove_input_padding disable
yuekaizhang commented 2 weeks ago

@lionsheep24 https://github.com/k2-fsa/sherpa/issues/597#issuecomment-2146719866, check this. You may need to align the prompt, beam_size, and other hyper-parameters to get the same outputs.

There are several succuss integration of whisper trt-llm you may refer e.g. https://github.com/Wordcab/wordcab-transcribe/tree/main/src/wordcab_transcribe/engines/tensorrt_llm. Your export steps also look good to me.

lionsheep24 commented 2 weeks ago

@yuekaizhang I'm using <|startoftranscript|><|ko|><|transcribe|><|notimestamps|> prompt, beam_size of 1 and found differences in extracting mel spectrograms from same audio array between hf-way and openai-way.

Same decoding results from different audio features, you mean? There were some values of -0.74171734 in hf-way but corresponding value of openai-way were 0.

I switched compute_feature function to hf WhisperFeatureExtractor but tokenizer throws OverflowError: out of range integral type conversion because decoding result has -1 token.

I reviewed link you shared but It seems to be similar with current repo.

I'm not sure how transcription results can be same even though extracted features are different.

lionsheep24 commented 1 week ago

Hi all! any updates here?

I am curious about why the audio features extracted from the same audio array differ when using the Huggingface library compared to the method provided in this repository.

Additionally, I want to confirm if it is correct for the values to be different. In my opinion, even if the model is converted, the input audio features should be same.

When I input the features extracted using the Huggingface library into the TensorRT-LLM engine, I received a -1 token(which is different from Huggingface pipeline result), which seems to have caused an error during decoding.

Feel free to let me know if you need any further adjustments or additional information included!

yuekaizhang commented 6 days ago

Huggingface library compared to the method provided in this repository.

Theoretically, the minor difference of feature values would not have a effect on the transcript results. We actually support huggingface distill whisper in tensorrt-llm, which uses the huggingface feature extractor to train. However, it could work with our feature extractor in inference.

You may try replace the feature extractor if you think that is the root cause. @lionsheep24

lionsheep24 commented 5 days ago

Yeah I calculated differences of features from huggingface and tensorrt-llm example and the absolute difference was up to 0.74. I think it's not a minor difference.

I tried to replace feature extractor to huggingface and feed feature to tensorrt-llm but I got -1 token from engine, as I mentioned earlier. @yuekaizhang