Open jcaw opened 3 years ago
Thanks for bringing this up, @jcaw. Apologies for a delayed response.
Current vad implementation is based on the behavior of sox and "Attempts to trim silence and quiet background sounds..." (http://sox.sourceforge.net/sox.html). Sox produces a truncated waveform, not the same-sized filled with zeros.
Since sox implementation of vad has multi-channel support, batched input is treated as multiple channels, and the entire batch is truncated to the earliest start of voice activity in any channel.
I see how the current behavior can be confusing and might require a more thorough design. For now we can disable batch consistency testing for vad and update the documentation explaining current behavior.
Thoughts, @mthrok and @vincentqb?
@astaff
Thanks for the explanation.
I do not think it make sense to treat channels and samples same here. For channels, we can make some strong assumptions reasonably, such as, audio tracks in different channels share the same time axis, and they record the same event, which can justify the implementation of F.vad
. For samples, such assumption does not apply. Especially when training a model and randomly shuffling audio files from various recording situations.
So wrapping F.vad
in T.VAD
just because other functionals for filtering do so does not make sense. If we want to have VAD for batch processing, we need to think of an expected usage before we start designing it. VAD itself is a big thing in audio application domain and active research/production. There are novel NN approaches published for VADs.
My opinion is that VAD as library function should be used more for inference than training. This is because when you train a model, you typically need to have annotated data, and you do not want to mess with the time frame.
On the other hand, in application layer, you want to have VAD to determine which frame you want to feed to ASR engine. For this kind of process, what you want is a detector function that tells if an input frame is voice activity or not. So the return type is boolean, and it has to be fast. VAD from WebRTC is one of such popular implementation. https://chromium.googlesource.com/external/webrtc/+/518c683f3e413523a458a94b533274bd7f29992d/webrtc/modules/audio_processing/vad/voice_activity_detector.h IIRC, it was also used in DeepSpeech example. https://github.com/mozilla/DeepSpeech-examples/blob/r0.9/mic_vad_streaming/mic_vad_streaming.py
Therefore I propose to remove T.VAD
from the library code. Padding the output Tensor is one way to resolve the issue, but since there is no universally agreed way to apply multi-channel VAD function on multiple samples, we should let users decide how different samples in a mini-batch should be handled.
Filed the task for removing T.VAD
from batch consistency test https://github.com/pytorch/audio/issues/1449
Considering your points above, @mthrok, here is the suggestion:
T
or N x T
(T
number of frames/samples, N
- batch size) and the input is a single channel or a batch of single channels.Just to re-emphasize some of what has been said here: I agree with comment that we can disable the batch consistency test and need to make sure the docstring represents what is being done.
As to how VAD should trim when presented with a batch, I'd like to get the opinion of someone who may have used this in the wild. @faroit maybe you have an opinion on this?
- Make it an explicit expectation that VAD takes and returns tensor of the same size and shape: either
T
orN x T
(T
number of frames/samples,N
- batch size) and the input is a single channel or a batch of single channels.
Do you mean to suggest to disable batch + channel support (e.g. N x C x T)?
- VAD returns a mask for voice activity.
(I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.)
- Calling code takes care of re-shaping multi-channel input and applying returned mask in a way that makes most sense for the task at hand.
Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.
Do you mean to suggest to disable batch + channel support (e.g. N x C x T)?
yes. this way we don't have to be opinionated about multi-channel behavior. the user code can re-shape a 3D tensor that has channels into 2D, apply the transform, and then apply a resulting mask and re-shape the output in a way that makes sense for downstream task: it could be an iterator over samples trimmed to voice activity in either channel, or elementwise multiplication, or something else.
I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.
agreed. should we keep backend-specific function signatures aligned with conventions of their respective backends?
Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.
i would expect that too. offsetting the application of a mask returned by VAD is proposed to let the user implement the behavior that makes the most sense for a given task.
yes. this way we don't have to be opinionated about multi-channel behavior. the user code can re-shape a 3D tensor that has channels into 2D, apply the transform, and then apply a resulting mask and re-shape the output in a way that makes sense for downstream task: it could be an iterator over samples trimmed to voice activity in either channel, or elementwise multiplication, or something else.
i'd note however that disabling batching is BC breaking, since we silently have been opinionated.
I'd note that this interface would be different from apply_effects_tensor (though the latter does not apply to batches) which apply transformations from sox.
agreed. should we keep backend-specific function signatures aligned with conventions of their respective backends?
we are not bound by constraints and design decisions from other backends, and it is more important to be consistent in how we present the interfaces to the users. we can always show examples say in the docstring of how to replicate with other backends -- or even let tests document the mapping :)
Is there a performance cost when offsetting the application of a mask returned by VAD? I'd expect elementwise multiplication by boolean to be fairly cheap.
i would expect that too. offsetting the application of a mask returned by VAD is proposed to let the user implement the behavior that makes the most sense for a given task.
the options to address the batch consistency behavior that I see are the following, and please feel free to comment if i'm missing any,
F.vad
and T.vad
entirely. can we confirm if this is duplicating the functionality of apply_effects_tensor? both are torchscriptable. F.vad works on GPU, but not apply_effects_tensor
, https://github.com/pytorch/audio/issues/1456. can we confirm performance differences?T.vad
only. The original design choice for functionals and transforms though was that both were doing the same, but one was a functional interface, and the other had a state.I see "changing the interface to return a mask" as addressing a different issue since we still end up having to decide how batching would be supported by the function. it may make going for option 1 (or 3) easier.
there are many "solid science-backed" ways to do VAD (and note that we have two already: the one discussed above and the one in the example). Having multiple algorithms available (and even adding new ones like WebRTC's mentioned above) to the user for VAD adds to the richness of the library. I'd love to hear from our users on which algorithms are desired (opened #1468 for this).
The most important thing is to document what happens and communicate to the user how the code has been and is working, and the rest can wait for strong feedback from the community. I favor option 2 (leaving this as is, but add doc), until we hear such feedback.
@astaff and @mthrok -- I see PR #1513 merged with a mention that the PR should bring this issue to a conclusion, but there are no comments about this here. It looks like option 1 is selected through the PR. However, the warning message is missing alternative steps the user could follow to reproduce the previous behavior. This can be done in the description of the relevant issue to make it easier for the user to follow.
The change in #1513 does not change the behavior, only adds a warning and it seems to be closer to no. 2 proposed above.
I am not sure what you two are referring by 2
as both of you left comment with ordered list.
My understanding is that #1513 is a temporary major as this discussion is not concluded.
I think, if we want to retain the VADs we should change it to return mask, as I cannot think of a use case for the current interface. Returning mask would allow to handle batched/multi-channel input easily.
I isolated a potential issue with batch consistency and F.vad in #1341:
@mthrok replied:
Opening a dedicated issue so the discussion can continue here.