lhotse-speech / lhotse

Tools for handling speech data in machine learning projects.
https://lhotse.readthedocs.io/en/latest/
Apache License 2.0
952 stars 217 forks source link

Augmentation with Recording length change #1164

Open rilshok opened 1 year ago

rilshok commented 1 year ago

I am encountering an issue when applying length change augmentation in Recording class. Specifically, I'm facing difficulties with the Speed augmentation, which is expected to modify the number of samples in the audio.

Steps to Reproduce

  1. Import Lhotse and necessary modules.

  2. Define an audio source and create a recording with specific attributes:

    from pathlib import Path
    
    from IPython.display import Audio
    
    import lhotse
    from lhotse.audio import Recording
    from lhotse.audio.source import AudioSource
    
    def play_record(record: Recording):
        # I use this because there is no play_audio method in Recording class
        return Audio(record.load_audio(), rate=record.sampling_rate)
    
    source = AudioSource(
        type="file",
        channels=[0],
        source=str(
            Path(lhotse.__file__).parent.parent
            / "test"
            / "fixtures"
            / "ljspeech"
            / "storage"
            / "LJ002-0020.wav"
        ),
    )
    record = Recording(
        id="LJ002-0020",
        sources=[source],
        sampling_rate=22050,
        num_samples=33949,
        duration=1.5396371882086168,
        transforms=None,
    )
    
    play_record(record)
  3. Apply the `ReverbWithImpulseResponse`` augmentation to the Record object. No problems occurred, the augmentation works as expected:

    from lhotse.augmentation import ReverbWithImpulseResponse
    from lhotse.augmentation.utils import FastRandomRIRGenerator
    
    rir = FastRandomRIRGenerator()
    record.transforms = [
        ReverbWithImpulseResponse(rir_generator=rir).to_dict(),
    ]
    play_record(record)
  4. Apply the Speed transformation and catch an exception when trying to apply augmentation:

    from lhotse.augmentation import Speed
    
    record.transforms = [Speed(1.1).to_dict()]
    record.load_audio()
    ValueError: The number of declared samples in the recording diverged from the one obtained
    when loading audio (offset=0.0, duration=None). This could be internal Lhotse's error or
    a faulty transform implementation. Please report this issue in Lhotse and show the following:
     diff=3086,
     audio.shape=(1, 30863),
     recording=Recording(id='LJ002-0020',
     sources=[
        AudioSource(
            type='file',
            channels=[0],
            source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')],
            sampling_rate=22050,
            num_samples=33949,
            duration=1.5396371882086168,
            channel_ids=[0],
            transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}]
        )
    
    [extra info] When calling:
    Recording.load_audio(
        args=(
            Recording(id='LJ002-0020',
            sources=[AudioSource(type='file', channels=[0], source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')],
            sampling_rate=22050,
            num_samples=33949,
            duration=1.5396371882086168,
            channel_ids=[0],
            transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}]),
        )
        kwargs={}
    )

Expected Behavior

I expect the audio transformation to be applied successfully, altering the length of the recording as specified by the transformation parameters, and that I can play the transformed audio without errors.

Actual Behavior

I encounter the ValueError mentioned above when attempting to apply the "Speed" transformation or a custom transformation that alters the audio length.

Additional information

Am I trying to apply augmentation to the Recording object correctly? I would like to be able to inherit my own lazy augmentation by inheriting from the AudioTransform class.

desh2608 commented 1 year ago

Quick note on play_record: you can do record.to_cut().play_audio().

desh2608 commented 1 year ago

The problem here is that you are adding a transform to the Recording which changes its duration and num_samples at the time of loading, but you have not made these changes in the manifest. If you look at the implementation of perturb_speed here, you can see that we also update the samples and duration when using the Speed transform.

rilshok commented 1 year ago

It seems that a good solution would be to redesign the augmentation base class so that the job of recalculating num_samples is taken over by the AudioTransform heir class.

desh2608 commented 1 year ago

I don't think that's a good solution. The augmentation classes only work on the audio, not the associated metadata, and they should not modify the Recording object itself. That modification should be done from a member function of the Recording class.