lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
956 stars 219 forks source link

Possible SpecAugment issue #648

Open videodanchik opened 2 years ago

videodanchik commented 2 years ago

I have a question about the current SpecAugment implementation. According to lhotse code, the time warping is applied only for "true" feature regions and it excludes "padded" regions i.e.

# Supervisions provided - we will apply time warping only on the supervised areas.
for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=False
    )

but later on, the masking step is applied to the whole feature matrices including the "padded" regions

# ... and then time-mask the full feature matrices. Note that in this mode,
# it might happen that masks are applied to different sequences/examples
# than the time warping.
for sequence_idx in range(features.size(0)):
    features[sequence_idx] = self._forward_single(
        features[sequence_idx], warp=False, mask=True
    )

For masking along the time axis it is fine, but for masking along the frequency axis we can end up masking "padded" regions for shorter segments in the batch. Is it intentional?

pzelasko commented 2 years ago

Yeah IIRC I left it like that as @danpovey was thinking it would either not make a difference or maybe help somehow, but I don't think we ever actually run an experiment to compare with masking speech only.

danpovey commented 2 years ago

mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.

videodanchik commented 2 years ago

mm, my suspicion is the network will be ignoring those regions so it won't make a difference. We actually use attention masks (and recently, I am also masking in the convolution module) so those regions are going to be totally ignored, well once we apply the convolution mask.

Ok thanks for the reply, so if I understand correctly since the model ignores the "padded" regions it makes sence to also ignore it during masking. Just to be clear what I'm reffering to here is an example of problem I describe.

Original Fbank from a batch with zero padding:

image

Same Fbank after SpeAugment:

image

As you can see after time warping we ended up with frequency mask overlapping zero padded region and in a fact this leads to useless frequency masking in this cases. Moreover this inconsystency also produces a small bug in current SpecAugment implementation. The line:

_max_tot_mask_frames = self.max_frames_mask_fraction * features.size(0)

will produce the same maximum masked frames across the whole batch as features.size(0) will be the same everywhere. This lead to incorrect num_frame_masks and max_mask_frames calculation. All of the mentioned above can be fixed by replacing

# Supervisions provided - we will apply time warping only on the supervised areas.
for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=False
    )
# ... and then time-mask the full feature matrices. Note that in this mode,
# it might happen that masks are applied to different sequences/examples
# than the time warping.
for sequence_idx in range(features.size(0)):
    features[sequence_idx] = self._forward_single(
        features[sequence_idx], warp=False, mask=True
    )

with

for sequence_idx, start_frame, num_frames in supervision_segments:
    end_frame = start_frame + num_frames
    features[sequence_idx, start_frame:end_frame] = self._forward_single(
        features[sequence_idx, start_frame:end_frame], warp=True, mask=True
    )

In this case you will end up with something like

image

where masking calculated and applied correctly. @pzelasko @danpovey what do you think about this modification, this should probably be tested though at least on LibriSpeech.

pzelasko commented 2 years ago

Sounds good to me! Would be great if you'd be able to run a proper comparison like you suggested.