Open videodanchik opened 2 years ago
Yeah IIRC I left it like that as @danpovey was thinking it would either not make a difference or maybe help somehow, but I don't think we ever actually run an experiment to compare with masking speech only.
mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.
mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.
Ok thanks for the reply, so if I understand correctly since the model ignores the "padded" regions it makes sence to also ignore it during masking. Just to be clear what I'm reffering to here is an example of problem I describe.
Original Fbank from a batch with zero padding:
Same Fbank after SpeAugment:
As you can see after time warping we ended up with frequency mask overlapping zero padded region and in a fact this leads to useless frequency masking in this cases. Moreover this inconsystency also produces a small bug in current SpecAugment
implementation. The line:
_max_tot_mask_frames = self.max_frames_mask_fraction * features.size(0)
will produce the same maximum masked frames across the whole batch as features.size(0)
will be the same everywhere. This lead to incorrect num_frame_masks
and max_mask_frames
calculation. All of the mentioned above can be fixed by replacing
# Supervisions provided - we will apply time warping only on the supervised areas.
for sequence_idx, start_frame, num_frames in supervision_segments:
end_frame = start_frame + num_frames
features[sequence_idx, start_frame:end_frame] = self._forward_single(
features[sequence_idx, start_frame:end_frame], warp=True, mask=False
)
# ... and then time-mask the full feature matrices. Note that in this mode,
# it might happen that masks are applied to different sequences/examples
# than the time warping.
for sequence_idx in range(features.size(0)):
features[sequence_idx] = self._forward_single(
features[sequence_idx], warp=False, mask=True
)
with
for sequence_idx, start_frame, num_frames in supervision_segments:
end_frame = start_frame + num_frames
features[sequence_idx, start_frame:end_frame] = self._forward_single(
features[sequence_idx, start_frame:end_frame], warp=True, mask=True
)
In this case you will end up with something like
where masking calculated and applied correctly. @pzelasko @danpovey what do you think about this modification, this should probably be tested though at least on LibriSpeech.
Sounds good to me! Would be great if you'd be able to run a proper comparison like you suggested.
I have a question about the current
SpecAugment
implementation. According tolhotse
code, the time warping is applied only for "true" feature regions and it excludes "padded" regions i.e.but later on, the masking step is applied to the whole feature matrices including the "padded" regions
For masking along the time axis it is fine, but for masking along the frequency axis we can end up masking "padded" regions for shorter segments in the batch. Is it intentional?