Issue with Testing a Fine-Tuned Pyannote Audio Model for Speaker Diarization

Winchester37 commented 8 months ago

Tested versions

pyannote.audio 3.1.1

System information

Windows 11 - pyannote.audio 3.1.1

Issue description

I have successfully fine-tuned a Pyannote Audio model for speaker diarization using a custom dataset and now I'm facing difficulties testing the fine-tuned model. Despite following the documentation and adjusting the paths for the model checkpoint and configuration file, I encounter errors when attempting to test the model on a new audio file.

Here's the training code snippet I used for fine-tuning:


# Training code snippet

`import os
import torch
os.environ["PYANNOTE_DATABASE_CONFIG"] = "/yedek/pyannote/gsmDatasets202/datasets.yaml"

from pyannote.database import registry , FileFinder
registry.load_database("/yedek/pyannote/gsmDatasets202/datasets.yaml")
dataset = registry.get_protocol("DATATEST.SpeakerDiarization.main", {"audio": FileFinder()})

from pyannote.audio.tasks import SpeakerDiarization
from pyannote.audio.models.segmentation import PyanNet

task = SpeakerDiarization(
    dataset,
    duration=5.0,
    max_speakers_per_chunk=2,
    max_speakers_per_frame=2,
    batch_size=128,
    num_workers=8,
    loss="bce"
)

model = PyanNet(task=task)

# this takes approximately 15min to run on Google Colab GPU
import torch
torch.set_float32_matmul_precision('high')
from types import MethodType
from torch.optim import Adam
from pytorch_lightning.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    RichProgressBar,
)

# we use Adam optimizer with 1e-4 learning rate
def configure_optimizers(self):
    return Adam(self.parameters(), lr=1e-4)

model.configure_optimizers = MethodType(configure_optimizers, model)

# we monitor diarization error rate on the validation set
# and use to keep the best checkpoint and stop early
monitor, direction = task.val_monitor
checkpoint = ModelCheckpoint(
    monitor=monitor,
    mode=direction,
    save_top_k=1,
    every_n_epochs=1,
    save_last=False,
    save_weights_only=False,
    filename="{epoch}",
    verbose=False,
)
early_stopping = EarlyStopping(
    monitor=monitor,
    mode=direction,
    min_delta=0.0,
    patience=10,
    strict=True,
    verbose=False,
)

callbacks = [RichProgressBar(), checkpoint, early_stopping]

# we train for at most 20 epochs (might be shorter in case of early stopping)
from pytorch_lightning import Trainer
trainer = Trainer(accelerator="gpu",
                  callbacks=callbacks,
                  max_epochs=200,
                  gradient_clip_val=0.5)
trainer.fit(model)

finetuned_model = checkpoint.best_model_path

print(finetuned_model)
`
And this is the testing code that leads to errors:

`from pyannote.audio import Model
import json

# Model ve yapılandırma dosyasının yolları
MODEL_PATH = "lightning_logs/version_24/checkpoints/epoch=57.ckpt"
CONFIG_PATH = "lightning_logs/version_9/hparams.yaml"
AUDIO_FILE_PATH = "wav2/20240123_112622.mp3"  # Test edilecek ses dosyası

# Konuşmacı diarizasyonu için hazır pipeline yükleniyor
pipeline = Model.from_pretrained(MODEL_PATH)

# Ses dosyası üzerinde diarizasyon gerçekleştiriliyor
diarization = pipeline(AUDIO_FILE_PATH)

# Diarizasyon sonuçlarının yazdırılması
output = []
for segment, _, speaker in diarization.itertracks(yield_label=True):
    start = round(segment.start, 2)  # Konuşmanın başladığı zaman (saniye cinsinden)
    end = round(segment.end, 2)  # Konuşmanın bittiği zaman (saniye cinsinden)
    output.append({"speaker": speaker, "start": start, "end": end})

# Sonuçların JSON olarak yazdırılması
print(json.dumps(output, indent=4))`

`Traceback (most recent call last):
  File "C:\Users\serca\PycharmProjects\pyannote\nemoo.py", line 14, in <module>
    diarization = pipeline(AUDIO_FILE_PATH)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\pyannote\audio\models\segmentation\PyanNet.py", line 172, in forward
    outputs = self.sincnet(waveforms)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\pyannote\audio\models\blocks\sincnet.py", line 81, in forward
    outputs = self.wav_norm1d(waveforms)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\instancenorm.py", line 71, in forward
    self._check_input_dim(input)
  File "C:\Users\serca\PycharmProjects\pyannote\venv2\lib\site-packages\torch\nn\modules\instancenorm.py", line 161, in _check_input_dim
    if input.dim() not in (2, 3):
AttributeError: 'str' object has no attribute 'dim'`

I'm looking for guidance on how to properly test the fine-tuned Pyannote Audio model or if there's any specific step I might be missing. Any help or pointers towards resolving this issue would be greatly appreciated.

Thank you in advance for your assistance.

### Minimal reproduction example (MRE)

https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/MRE_template.ipynb#scrollTo=gVrDtBcusDbK

FrenchKrab commented 8 months ago

I think you are confusing pyannote's "models" (pyannote.audio.models.....) and pyannote's "pipelines" (pyannote.audio.pipelines.....). The model that you finetune/train is the 'segmentation' model, it performs the speaker diarization task on duration=5.0 seconds windows.

To obtain the final diarization output on a whole audio file, we need to aggregate multiple outputs of this local segmentation model, see paper pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe for more details about it.

There may be examples in a pyannote tutorial notebook, but I can't remember which one, so here is a pretty complete notebook about training a model and testing its pipeline (in particular the "Adapted pipeline output" section).

stale[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

pyannote / pyannote-audio