Augmentation with Recording length change

rilshok commented 1 year ago

I am encountering an issue when applying length change augmentation in Recording class. Specifically, I'm facing difficulties with the Speed augmentation, which is expected to modify the number of samples in the audio.

Steps to Reproduce

Import Lhotse and necessary modules.

Define an audio source and create a recording with specific attributes:

from pathlib import Path

from IPython.display import Audio

import lhotse
from lhotse.audio import Recording
from lhotse.audio.source import AudioSource

def play_record(record: Recording):
    # I use this because there is no play_audio method in Recording class
    return Audio(record.load_audio(), rate=record.sampling_rate)

source = AudioSource(
    type="file",
    channels=[0],
    source=str(
        Path(lhotse.__file__).parent.parent
        / "test"
        / "fixtures"
        / "ljspeech"
        / "storage"
        / "LJ002-0020.wav"
    ),
)
record = Recording(
    id="LJ002-0020",
    sources=[source],
    sampling_rate=22050,
    num_samples=33949,
    duration=1.5396371882086168,
    transforms=None,
)

play_record(record)

Apply the `ReverbWithImpulseResponse`` augmentation to the Record object. No problems occurred, the augmentation works as expected:

from lhotse.augmentation import ReverbWithImpulseResponse
from lhotse.augmentation.utils import FastRandomRIRGenerator

rir = FastRandomRIRGenerator()
record.transforms = [
    ReverbWithImpulseResponse(rir_generator=rir).to_dict(),
]
play_record(record)

Apply the Speed transformation and catch an exception when trying to apply augmentation:

from lhotse.augmentation import Speed

record.transforms = [Speed(1.1).to_dict()]
record.load_audio()

ValueError: The number of declared samples in the recording diverged from the one obtained
when loading audio (offset=0.0, duration=None). This could be internal Lhotse's error or
a faulty transform implementation. Please report this issue in Lhotse and show the following:
 diff=3086,
 audio.shape=(1, 30863),
 recording=Recording(id='LJ002-0020',
 sources=[
    AudioSource(
        type='file',
        channels=[0],
        source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')],
        sampling_rate=22050,
        num_samples=33949,
        duration=1.5396371882086168,
        channel_ids=[0],
        transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}]
    )

[extra info] When calling:
Recording.load_audio(
    args=(
        Recording(id='LJ002-0020',
        sources=[AudioSource(type='file', channels=[0], source='/.../lhotse/test/fixtures/ljspeech/storage/LJ002-0020.wav')],
        sampling_rate=22050,
        num_samples=33949,
        duration=1.5396371882086168,
        channel_ids=[0],
        transforms=[{'name': 'Speed', 'kwargs': {'factor': 1.1}}]),
    )
    kwargs={}
)

Expected Behavior

I expect the audio transformation to be applied successfully, altering the length of the recording as specified by the transformation parameters, and that I can play the transformed audio without errors.

Actual Behavior

I encounter the ValueError mentioned above when attempting to apply the "Speed" transformation or a custom transformation that alters the audio length.

Additional information

There is no problem when using length-preserving transforms such as Volume.
The problem also arises when implementing a custom augmentation by inheriting from the AudioTransform class.

Am I trying to apply augmentation to the Recording object correctly? I would like to be able to inherit my own lazy augmentation by inheriting from the AudioTransform class.

desh2608 commented 1 year ago

Quick note on play_record: you can do record.to_cut().play_audio().

desh2608 commented 1 year ago

The problem here is that you are adding a transform to the Recording which changes its duration and num_samples at the time of loading, but you have not made these changes in the manifest. If you look at the implementation of perturb_speed here, you can see that we also update the samples and duration when using the Speed transform.

rilshok commented 1 year ago

It seems that a good solution would be to redesign the augmentation base class so that the job of recalculating num_samples is taken over by the AudioTransform heir class.

desh2608 commented 1 year ago

I don't think that's a good solution. The augmentation classes only work on the audio, not the associated metadata, and they should not modify the Recording object itself. That modification should be done from a member function of the Recording class.

lhotse-speech / lhotse