huggingface / speechbox

Apache License 2.0
342 stars 33 forks source link

Improving timestamp accuracy in diarize.py by handling None #32

Open Demon-tk opened 10 months ago

Demon-tk commented 10 months ago

Dear Maintainers,

I am submitting this pull request as a proposed solution to issue #29. I discovered a few potential edge cases in the diarize.py script where end timestamps may be None, subsequently causing errors and misalignments between diarizer and ASR timestamps. Besides, I noticed the alignment not being always precise as it wasn't taking into account the total duration of the inputs.

Here's a concise overview of the modifications I have made:

  1. Handling of None End Timestamp: Introduced a safety check to ensure that if the last end timestamp from the ASR output is None, it gets replaced by the total duration of the inputs. This alteration works as a safety net to avoid potential errors if for any reason, the ASR fails to provide an end timestamp for the last chunk.

  2. Alignment Condition: Implemented a conditional statement that allows the search for the closest ASR end timestamp to the diarizer's end timestamp to happen only if the first end timestamp is not None. This ensures that the alignment operation doesn't execute on potentially faulty data.

These changes aim to bolster the code's robustness by counteracting corner cases that may induce errors.

Please note, these modifications do not introduce breaking changes or alterations to functionality. They aim to heighten the precision and reliability of the diarize.py script. I am hopeful these changes prove beneficial to the project.

Nate