miguelvalente / whisperer

Go from raw audio files to a text-audio dataset automatically with OpenAI's Whisper.
132 stars 12 forks source link

First feedback #3

Closed Ca-ressemble-a-du-fake closed 1 year ago

Ca-ressemble-a-du-fake commented 1 year ago

Hi Miguel,

I finally could give your dataset generator a try! It works pretty well for me so thanks again for sharing it!

I have a few questions and remarks :

  1. When I use stock Whisper large-v2 model it takes up around 12-14GB of VRAM (can't remember exactly). How come that when I use your generator it uses 18 GB ? Is there a memory leak or it is due to the batch size (I wasn't aware there was one for Whisper!) ?
  2. What is batch size used for ?
  3. When I change dataset name in config file it does not change anything. Only the command line argument is taken into account.
  4. Language could be a parameter in config (I use fr).
  5. The min and max length are in char right ? And the loc is also in char or in seconds ? Should I change that "loc" ?
  6. Most of the time the cutting is good but sometimes it gets off by some ms. For example "ty" is missing in "electricity" or "by 10%" is missing (written in transcription but cannot be heard). Could it be improved or it is due to the model ? I'm not sure using stable-ts could mitigate this problem because you pass "without_timestamps" to whisper transcribe.
  7. By the way how can you cut the audio so nicely without resorting to whisper timestamps ?
  8. Oh and I forgot to mention that poetry was stuck during install. I let it run for a couple of hours and it was still at point 2 or 3. Then I restarted it and same thing appeared. So I used the venv that I use for Whisper and fortunately it worked.
  9. And last but not least, my audio files come from Youtube so include spaces, punctuations, ... . The subprocess Popen does not cope with them. I am used to use subprocess check_output instead with each argument separated by a coma.

Looking forward to reading from you!

miguelvalente commented 1 year ago

Hey @Ca-ressemble-a-du-fake , thanks for taking it for a spin, and for the lengthy feedback.

  1. The batch size definitely affects the amount of VRAM used.
  2. The batch size indicates how many audio segments are translated by Whisper at a time.
  3. You are correct. The dataset_name is now set by calling it an argument alongside transcribe.
  4. I'll look into it adding the languages supported by Whisper.
  5. The min-max is indeed in char. The loc is in seconds. I'll try to make this clearer. You can change the loc to change the center of the distribution.
  6. Since the splitting is automatic there will always be some cases where it will fail. Those can be safely deleted. Beyond that, you can play around with the audio split configurations on config.py to see if you obtain better results. I'll also add a notebook to visualize the silences on audio files so that you can easily adjust the configs to your case.
  7. I guess that you can take a look a the code for that one hahah.
  8. Can you please provide some screenshots or prints with the error? Also your OS.
  9. Can you give me an example of a name? For testing purposes.

Once again thanks for the helpful feedback I'll try to get to it ASAP. But I'm planning to implement diarization soon :)

Ca-ressemble-a-du-fake commented 1 year ago

Thanks for your answers now I can understand better your code !

Regarding 8. (poetry), it's running under latest Ubuntu 22.04 and the output gets stuck at this stage :

(whisperenv) caraduf@caraduf-gpu:~/Whisperer/whisperer$ poetry install
Installing dependencies from lock file

Package operations: 75 installs, 21 updates, 0 removals

  • Installing platformdirs (2.6.0): Pending...
  • Installing pyrsistent (0.19.2): Pending...
  • Installing traitlets (5.6.0): Pending...

By the way why not just use the more spread requirements.txt or make a pip package ?

For 9. you can take whatever name that contains space in it : "How to do XYZ". This is described in this question.

miguelvalente commented 1 year ago

Hello again.

:heavy_check_mark: I fixed the FFmpeg bug, any name should be able to work now.

:white_check_mark: I've also updated poetry.lock to see if works now. I use poetry because it automatically resolves dependency issues and it's pretty good for versioning projects.

:pray: I'd really appreciate it if you could take out some time to help me figure out this poetry error. Could you follow these steps:

According to instructions found here

  1. Delete the folder project and clone the updated version of the repo.
  2. Without initializing a environment run poetry install
  3. See if it works.
  4. Otherwise try the following commands to clear poetry's cache: poetry cache clear PyPI --all poetry cache clear _default_cache --all poetry install
miguelvalente commented 1 year ago

Closing due to no feedback

Ca-ressemble-a-du-fake commented 1 year ago

Oh yeah sorry I didn't want at first to mess up my environment and then I worked on the manual cleaning of my datasets and finally I forgot to give you a follow up😄 !

miguelvalente commented 1 year ago

Haha, no problem. Also nice to point out that you can diarize across audio files:). Let me know if you end up trying my fix protocol.