pyannote / pyannote-audio

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
http://pyannote.github.io
MIT License
6.38k stars 784 forks source link

DER Calculation on the Aishell-4 Dataset Using pyannote.audio Model Returns NaN #1790

Closed sipercai closed 3 days ago

sipercai commented 4 days ago

Tested versions

pyannote.audio 3.3.2 pyannote.core 5.0.0 pyannote.metrics 3.2.1

System information

CentOS Linux 7 - pyannote.audio 3.3.2

Issue description

Title: NAN issue when calculating DER in local environment but not in Colab Description: When calculating the Diarization Error Rate (DER), I encountered an issue where the result is NAN. Through debugging and investigation, I found that in the discrete_diarization_error_rate function, the value of np.sum(reference) might be 0. This leads to a division - by - zero situation in the calculation process. Subsequently, this causes the compute_metric function to generate NAN as the result. However, I tried to reproduce this problem in Google Colab, but the attempt failed. In the Colab environment, I got normal results without encountering the NAN issue. I suspect there might be some differences between my local CentOS 7 environment and the Colab environment that are causing this discrepancy. I'm hoping to get some help to figure out what's going on and how to solve this problem in my local setup. Here are some details of my local environment and the relevant code snippets: Local Environment: Operating System: CentOS 7 Python version: try 3.9 、3.10 、3.11 Installed libraries and their versions: pip install pyannote.audio==3.3.2 Code Details: In the discrete_diarization_error_rate function, the calculation involves the following steps (simplified example): python Copy

reference = reference.astype(np.half)
hypothesis = hypothesis.astype(np.half)

(hypothesis,), _ = permutate(reference[np.newaxis], hypothesis)

total = 1.0 * np.sum(reference)

detection_error = np.sum(hypothesis, axis=1) - np.sum(reference, axis=1)
false_alarm = np.maximum(0, detection_error)
missed_detection = np.maximum(0, -detection_error)

confusion = np.sum((hypothesis != reference) * hypothesis, axis=1) - false_alarm

false_alarm = np.sum(false_alarm)
missed_detection = np.sum(missed_detection)
confusion = np.sum(confusion)

der = (false_alarm + missed_detection + confusion) / total

As you can see, when np.sum(reference) is 0, it causes the problem. But this doesn't seem to happen in Colab, and I'm not sure what the difference is. Any help or suggestions would be greatly appreciated.

Minimal reproduction example (MRE)

https://drive.google.com/file/d/1-7-hVke3V9j54XL5sbX198YaVe4QSaQg/view?usp=sharing

clement-pages commented 4 days ago

Hey @sipercai, Did you check whether the content of corresponding RTTM and UEM file is correct ? Based on what you say, it seems that there is no active speaker in the reference, maybe because RTTM file is empty, or because intersection between speaker turns and uem is empty (for any reason). You could also check the content of reference before computing the metric, just to be sure that the reference is not empty.

sipercai commented 4 days ago

Hey @sipercai, Did you check whether the content of corresponding RTTM and UEM file is correct ? Based on what you say, it seems that there is no active speaker in the reference, maybe because RTTM file is empty, or because intersection between speaker turns and uem is empty (for any reason). You could also check the content of reference before computing the metric, just to be sure that the reference is not empty.

Thank you for taking the time to reply. I'm working hard to figure out the reasons for the errors in the DER calculation and have started to look into issues related to the RTTM files and UEM. You can check the sample at this link: https://drive.google.com/file/d/1-7-hVke3V9j54XL5sbX198YaVe4QSaQg/view?usp=sharing. I'm really puzzled about why it can produce non-NAN outputs when running in Colab. Moreover, I'd like to ask for your help. Is there a good way to save the hypothesis as a file and then read it back later? This would be of great help for me to debug the DiscreteDiarizationErrorRate in pyannote.audio.utils.metric. Thank you for your patient explanation.

clement-pages commented 4 days ago

Moreover, I'd like to ask for your help. Is there a good way to save the hypothesis as a file and then read it back later? This would be of great help for me to debug the DiscreteDiarizationErrorRate in pyannote.audio.utils.metric.

You can do that by following this code snippet:

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.1",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

And to load the RTTM file:

from pyannote.database.util import load_rttm

uri = ... # file URI 
diarization = load_rttm("path/to/your/rttm/file")[uri]
sipercai commented 3 days ago

Dear @clement-pages, I wanted to share an update. I have now successfully obtained normal output values when testing the use of metrics to evaluate two rttm results. While testing the output of the segmentation, which is of the SlidingWindowFeature class, should theoretically allow for a comparison before and after fine - tuning, I think that testing the pipeline is a more accurate approach. I'm not sure if my understanding in this regard is correct, though. I still haven't fully understood why the segmentation evaluation failed and why the results in Colab were different. However, I will look into these details more carefully when I have more time in the future. For now, I'm closing this issue. Thank you very much for your patient responses and explanations throughout this process. It has been really helpful. Best regards.