The processing efficiency and sampling rate problem of OPUS files

yangb05 commented 1 year ago

I'am trying to process a large dataset with .wav and .opus files recently, and found that the processing of .wav files is nearly 6 times faster than the processing of .opus files, specifically in the generation of recordings and supervisions. After debugging, I found the difference is that .wav file is processed with torchaudio and .opus file is processed with ffmpeg. The read_opus function in lhotse/audio/backend.py is:

def read_opus(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files either using torchaudio or ffmpeg.
    Torchaudio is faster, but if unavailable for some reason,
    we fallback to a slower ffmpeg-based implementation.

    :return: a tuple of audio samples and the sampling rate.
    """
    # TODO: Revisit using torchaudio backend for OPUS
    #       once it's more thoroughly benchmarked against ffmpeg
    #       and has a competitive I/O speed.
    #       See: https://github.com/pytorch/audio/issues/1994
    # try:
    #     return read_opus_torchaudio(
    #         path=path,
    #         offset=offset,
    #         duration=duration,
    #         force_opus_sampling_rate=force_opus_sampling_rate,
    #     )
    # except:
    return read_opus_ffmpeg(
        path=path,
        offset=offset,
        duration=duration,
        force_opus_sampling_rate=force_opus_sampling_rate,
    )

Althought the note says ffmpeg is faster, but in my case, torchaudio is better. I just use the read_opus_torchaudio in the above code, then the speedup appears. pytorch: 1.13 ffmpeg: Untitled torchaudio: 1694747753603

Also, there is another problem when using the read_opus_ffmpeg function:

def read_opus_ffmpeg(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files using ffmpeg in a shell subprocess.
    Unlike audioread, correctly supports offsets and durations for reading short chunks.
    Optionally, we can force ffmpeg to resample to the true sampling rate (if we know it up-front).

    :return: a tuple of audio samples and the sampling rate.
    """
    # Construct the ffmpeg command depending on the arguments passed.
    cmd = "ffmpeg -threads 1"
    sampling_rate = 48000
    # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
    #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
    if offset > 0:
        cmd += f" -ss {offset}"
    if duration is not None:
        cmd += f" -t {duration}"
    # Add the input specifier after offset and duration.
    cmd += f" -i {path}"
    # Optionally resample the output.
    if force_opus_sampling_rate is not None:
        cmd += f" -ar {force_opus_sampling_rate}"
        sampling_rate = force_opus_sampling_rate
    # Read audio samples directly as float32.
    cmd += " -f f32le -threads 1 pipe:1"
    # Actual audio reading.
    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    raw_audio = proc.stdout
    audio = np.frombuffer(raw_audio, dtype=np.float32)
    # Determine if the recording is mono or stereo and decode accordingly.
    try:
        channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
        if channel_string == "stereo":
            new_audio = np.empty((2, audio.shape[0] // 2), dtype=np.float32)
            new_audio[0, :] = audio[::2]
            new_audio[1, :] = audio[1::2]
            audio = new_audio
        elif channel_string == "mono":
            audio = audio.reshape(1, -1)
        else:
            raise NotImplementedError(
                f"Unknown channel description from ffmpeg: {channel_string}"
            )
    except ValueError as e:
        raise AudioLoadingError(
            f"{e}\nThe ffmpeg command for which the program failed is: '{cmd}', error code: {proc.returncode}"
        )
    return audio, sampling_rate

It assumes all the .opus files have sampling_rate 48000，that will be a problem if the dataset is not so normal, for example, in my case, it could be 16000. Then, the recorded sampling_rate will be 48000 while the file is read with actual sampling_rate 16000 if the force_opus_sampling_rate is not specified, which will affect the following computation of num_samples and features. I think just set the cmd with '-ar sampling_rate ' will solve the problem, for example:

def read_opus_ffmpeg(
    path: Pathlike,
    offset: Seconds = 0.0,
    duration: Optional[Seconds] = None,
    force_opus_sampling_rate: Optional[int] = None,
) -> Tuple[np.ndarray, int]:
    """
    Reads OPUS files using ffmpeg in a shell subprocess.
    Unlike audioread, correctly supports offsets and durations for reading short chunks.
    Optionally, we can force ffmpeg to resample to the true sampling rate (if we know it up-front).

    :return: a tuple of audio samples and the sampling rate.
    """
    # Construct the ffmpeg command depending on the arguments passed.
    cmd = "ffmpeg -threads 1"
    sampling_rate = 48000
    # Note: we have to add offset and duration options (-ss and -t) BEFORE specifying the input
    #       (-i), otherwise ffmpeg will decode everything and trim afterwards...
    if offset > 0:
        cmd += f" -ss {offset}"
    if duration is not None:
        cmd += f" -t {duration}"
    # Add the input specifier after offset and duration.
    cmd += f" -i {path}"
    # Optionally resample the output.
    if force_opus_sampling_rate is not None:
        sampling_rate = force_opus_sampling_rate
    cmd += f" -ar {sampling_rate}"
    # Read audio samples directly as float32.
    cmd += " -f f32le -threads 1 pipe:1"
    # Actual audio reading.
    proc = run(cmd, shell=True, stdout=PIPE, stderr=PIPE)
    raw_audio = proc.stdout
    audio = np.frombuffer(raw_audio, dtype=np.float32)
    # Determine if the recording is mono or stereo and decode accordingly.
    try:
        channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
        if channel_string == "stereo":
            new_audio = np.empty((2, audio.shape[0] // 2), dtype=np.float32)
            new_audio[0, :] = audio[::2]
            new_audio[1, :] = audio[1::2]
            audio = new_audio
        elif channel_string == "mono":
            audio = audio.reshape(1, -1)
        else:
            raise NotImplementedError(
                f"Unknown channel description from ffmpeg: {channel_string}"
            )
    except ValueError as e:
        raise AudioLoadingError(
            f"{e}\nThe ffmpeg command for which the program failed is: '{cmd}', error code: {proc.returncode}"
        )
    return audio, sampling_rate

pzelasko commented 1 year ago

Hmm, I remember disabling it because I found the reverse to be true on some systems. I think the best way forward would be to expose the control over this to the user. I'll aim to make a PR to enable this later as I was recently refactoring some of this code, it should be easily doable.

pzelasko commented 1 year ago

Regarding 48kHz vs 16kHz, I'm not sure I got your point. OPUS is always decoded to 48kHz even if the original audio had smaller sampling rate, unless I missed something.

yangb05 commented 1 year ago

Regarding 48kHz vs 16kHz, I'm not sure I got your point. OPUS is always decoded to 48kHz even if the original audio had smaller sampling rate, unless I missed something.

For example, I have a .opus file in my dataset, if I use torchaudio.info() to get the sampling rate, it shows 16kHz. Also, if I use ffmpeg to read it, the information shows the input sampling rate is 16kHz. If the param _force_opus_samplingrate is not passed to read_opus_ffmpeg, then the number of samples will be read in 16kHz(actual) while with the sampling rate 48kHz(default) in the recording. Assume read_opus_ffmpeg reads 30,000 samples in this .opus file, and the recorded sampling rate is 48kHz. When I try to resample it to 16kHz in the cut set, the recorded number of samples will reduced to 10,000 from 30,000. Now,

The recorded info: {sampling rate: 16kHz, num_samples: 10000}
The actual info: {sampling rate: 16kHz, num_samples: 30000}

It will cause a mismatch in the subsequent computations.

pzelasko commented 1 year ago

If the file has 16kHz, that makes sense. I just never encountered an OPUS file that actually has a sampling rate other than 48kHz, even when I encoded WAV data into OPUS that had a smaller SR...

I think your proposed changes make sense, could you make a PR?

yangb05 commented 1 year ago

OK.

lhotse-speech / lhotse

The processing efficiency and sampling rate problem of OPUS files #1149