huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
15.33k stars 896 forks source link

Error: RIFF file type not "WAVE" Whisper #631

Closed noahgift closed 1 year ago

noahgift commented 1 year ago

Hi,

I have been able to get cuda, cudann and CPU Whisper Examples to work fine, except for when I try to pass in an audio file. Are CLI inputs supported yet?

ubuntu@ip-172-31-21-63:~$ wget https://huggingface.co/datasets/Narsil/candle_demo/blob/main/samples_jfk.wav
ubuntu@ip-172-31-21-63:~/candle$ cargo run --features cuda --example whisper -- --task transcribe --input ../samples_jfk.wav 
   Compiling cudarc v0.9.14
   Compiling candle-examples v0.2.0 (/home/ubuntu/candle/candle-examples)
   Compiling candle-kernels v0.2.0 (/home/ubuntu/candle/candle-kernels)
   Compiling candle-core v0.2.0 (/home/ubuntu/candle/candle-core)
   Compiling candle-nn v0.2.0 (/home/ubuntu/candle/candle-nn)
   Compiling candle-datasets v0.2.0 (/home/ubuntu/candle/candle-datasets)
   Compiling candle-transformers v0.2.0 (/home/ubuntu/candle/candle-transformers)
    Finished dev [unoptimized + debuginfo] target(s) in 14.57s
     Running `target/debug/examples/whisper --task transcribe --input ../samples_jfk.wav`
Error: RIFF file type not "WAVE"
LaurentMazare commented 1 year ago

I think your wget is grabbing the html page rather than the wav file itself :) Note that if you don't pass any input argument, the download will happen automatically from the hub.

noahgift commented 1 year ago

I think your wget is grabbing the html page rather than the wav file itself :) Note that if you don't pass any input argument, the download will happen automatically from the hub.

Thanks, this is what I get for living in the shell and not bothering to open the links I download :)

Verified this flag does work:

cargo run --features cuda --example whisper -- --task transcribe --input ../samples_jfk.wav

I manually went to the webpage, then scp'd it:

scp -i ~/Downloads/llmops.pem /Users/noahgift/Downloads/samples_jfk.wav ubuntu@54.81.160.62:~/

➜ ~ du -sh /Users/noahgift/Downloads/samples_jfk.wav 344K /Users/noahgift/Downloads/samples_jfk.wav

ubuntu@ip-172-31-21-63:~/candle$ cargo run --features cuda --example whisper -- --task transcribe --input ../samples_jfk.wav
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/examples/whisper --task transcribe --input ../samples_jfk.wav`
loaded wav data: Header { audio_format: 1, channel_count: 1, sampling_rate: 16000, bytes_per_second: 32000, bytes_per_sample: 2, bits_per_sample: 16 }
pcm data loaded 176000
loaded mel: [1, 80, 3000]
0.0s -- 30.0s:  And so my fellow Americans ask not what your country can do for you ask what you can do for your country

The reason I asked was a previous file I had used the python whisper.py with seems to act wonky.

ubuntu@ip-172-31-21-63:~/candle$ cargo run --features cuda --example whisper -- --task transcribe --input ../four-score.wav 
    Finished dev [unoptimized + debuginfo] target(s) in 0.14s
     Running `target/debug/examples/whisper --task transcribe --input ../four-score.wav`
loaded wav data: Header { audio_format: 1, channel_count: 2, sampling_rate: 16000, bytes_per_second: 64000, bytes_per_sample: 4, bits_per_sample: 16 }
pcm data loaded 636224
loaded mel: [1, 80, 6000]
no speech detected, skipping 3000 DecodingResult { tokens: [50257, 50358, 50362, 314, 1101, 8066, 467, 329, 257, 1178, 286, 262, 986, 50256], text: " I'm gonna go for a few of the...", avg_logprob: -1.7409689713691172, no_speech_prob: 0.6462810635566711, temperature: 0.0, compression_ratio: NaN }
30.0s -- 60.0s:  I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that I'm going to be sure that

I wasn't sure if I need to set some other defaults and was digging into the clap code here.

My original file was here

No worries on my end, I can dig into settings and flags and see if I can figure out what is going on.

LaurentMazare commented 1 year ago

Agreed that the whisper output is pretty disappointing on your example. The whisper setup is pretty involved and tweaking the flags may help though in that case I didn't get anywhere, I would just suggest increasing the model size as it usually helps.

cargo run --example whisper --profile=release-with-debug -- --input ~/Downloads/four-score.wav --model medium.en --task transcribe
loaded wav data: Header { audio_format: 1, channel_count: 2, sampling_rate: 16000, bytes_per_second: 64000, bytes_per_sample: 4, bits_per_sample: 16 }
pcm data loaded 636224
loaded mel: [1, 80, 6000]
0.0s -- 30.0s:  Fast forward and seven years ago our fathers bore forth on this continent a new nation conceived of liberty and dedicated to the proposition that all men are created equal.
30.0s -- 60.0s:  We are engaged in a great civil war. Testing whether that mission or any mission

Besides this it would be great if you have the openai whisper version at hand to give it a shot and if it performs well I can have a look at trying to understand the discrepancies.

noahgift commented 1 year ago

Agreed that the whisper output is pretty disappointing on your example. The whisper setup is pretty involved and tweaking the flags may help though in that case I didn't get anywhere, I would just suggest increasing the model size as it usually helps.

cargo run --example whisper --profile=release-with-debug -- --input ~/Downloads/four-score.wav --model medium.en --task transcribe
loaded wav data: Header { audio_format: 1, channel_count: 2, sampling_rate: 16000, bytes_per_second: 64000, bytes_per_sample: 4, bits_per_sample: 16 }
pcm data loaded 636224
loaded mel: [1, 80, 6000]
0.0s -- 30.0s:  Fast forward and seven years ago our fathers bore forth on this continent a new nation conceived of liberty and dedicated to the proposition that all men are created equal.
30.0s -- 60.0s:  We are engaged in a great civil war. Testing whether that mission or any mission

Besides this it would be great if you have the openai whisper version at hand to give it a shot and if it performs well I can have a look at trying to understand the discrepancies.

Awesome! Thanks, this was very helpful. I will leave ticket open and do some tests and report back. I have done quite a bit of work on Python MLOPs GPU GitHub Codespaces so it pretty easy to go back and forth to Python and Rust and test things out.

Also, the new ssh-remote workflow is not horrible on AWS, so I can test on those as well.

LaurentMazare commented 1 year ago

Closing this one as no recent activity, hopefully it's all sorted out.