Speech durations extraction with brouhaha's VAD

ariadnasc commented 3 months ago

Hi! I've been working with the dataspeech package for a few weeks now, and I bumped into a behaviour that could be an issue for specific datasets (like the one I am using) but can easily be fixed - I'm describing it below.

Description of the issue

When running the data extraction pipeline for a dataset, i.e. running main.py, I encountered the following error message:

pyarrow.lib.ArrowInvalid: Float value 2.3625 was truncated converting to int64

I have figured out the the issue comes when the speech duration output of Brouhaha's VAD model for the first sample of a batch is an integer (which could be 0 if the VAD finds no speech, but it could also happen if the duration is e.g. exactly 1 second). When this happens, the speech_durations' list is created as a list of integers (and not floats), which causes this error for the next sample (as it will require to be truncated as it's a float).

Proposed solution

As a quick fix which has worked for me, I have wrapped the speech_duration introduced to the list with np.float32(). I have done this in line 49 of dataspeech/gpu_enrichments/snr_and_reverb.py. Once I make this change, the pipeline runs successfully. The change looks like below:

snr.append(res["snr"][mask].mean())
c50.append(res["c50"][mask].mean())
vad_durations.append(np.float32(vad_duration))

If you agree with the change that I propose, I am happy to create a commit to push it. Otherwise, please advise to prevent something like this from happening.

Thanks!

IIEleven11 commented 3 months ago

I had the same issue. That solution above worked.

ylacombe commented 2 months ago

Hey @ariadnasc, thanks for opening this issue! Could you open a PR to deal with this ? Thanks in advance!

ariadnasc commented 2 months ago

Hey @ylacombe, no worries! I'll submit the PR tomorrow :)

ariadnasc commented 2 months ago

@ylacombe I've just opened the PR - thanks!

ylacombe commented 2 months ago

It's fixed, thanks for your help!

huggingface / dataspeech

Speech durations extraction with brouhaha's VAD #33

Description of the issue

Proposed solution