m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.49k stars 1.31k forks source link

Slovak model alignment #82

Open Tortoise17 opened 1 year ago

Tortoise17 commented 1 year ago

Dear Friends. I have to ask that can this model be used for Slovak alignment? https://huggingface.co/infinitejoy/wav2vec2-large-xls-r-300m-slovak For me the confusing point is the labels. From where the labels come if this can be used.

Please if you can guide.

emenems commented 1 year ago

Greetings my fellow Slovak-language user. To use that model, you just have to add it to the whisperx/alignment.py file (before you locally install the package).

In other (English) words::

  1. git clone https://github.com/m-bain/whisperX.git
  2. Go to whisperX/whisperx/alignment.py and add your model to DEFAULT_ALIGN_MODELS_HF: "sk": "infinitejoy/wav2vec2-large-xls-r-300m-slovak"
  3. Install the package pip install -e whisperX
  4. As you run the model locally, you will need to install: apt update && apt install ffmpeg (use sudo if required)
  5. Just to be safe, also install pip install setuptools-rust
  6. And then run something like this in Python (here on "dobre_rano*" podcast):
import whisperx

device = "cuda" 
audio_file = "dobre_rano_prvym_rozsudkom_sa_nic_nekonci_imrecze_moze_skoncit_aj_za_mrezami_10_2_2023.mp3"

# transcribe with original whisper using the Large mode
model = whisperx.load_model("large", device)
result = model.transcribe(audio_file, verbose=False) # quickly get some popcorn, it will take ~10 min

# Print me some segments so we can prove in a minute if the model works
for segment in result["segments"]:
    print(segment)

# Now comes the important part. Set the language code to 'sk' as used in the whisperx/alignment.py file (you would get error if the language is not known)
model_a, metadata = whisperx.load_align_model(language_code="sk", device=device)

# Run the alignment. This one is rather fast, no time to get additional chips
result_aligned = whisperx.align(result["segments"], model_a, metadata, audio_file, device)

# Format the output to similar result as for whisper segments
for segment in result_aligned["segments"]:
    print(f"'start: '{segment['start']}, 'end': {segment['end']}, 'text': {segment['text']}")

I am not suggesting this is a good wav2vec model to use, these are only the instructions how utilise it in WhisperX. Try yourself and you will see.

m-bain commented 1 year ago

@Tortoise17 any results, was this a good wav2vec2 model? I will add to defaults if you found it successful