The number of declared samples in the recording diverged from the one obtained when loading audio

shcxlee commented 1 year ago

Hello,

I am trying to utilize annotate_with_whisper to generate transcriptions of my dataset. However, while running the code, I've encountered an issue that the number of samples of the Recording does not match with the output size of audio (It seems like the size of audio is almost twice the num_samples of Recording) https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L71

1) Setting num_samples in Recording as math.ceil(duration*16000) (16000=sample rate)

ValueError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0.0, duration=None). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=-69195094, audio.shape=(1, 138390189), recording=Recording(id='COL509', sources=[AudioSource(type='file', channels=[0], source='LGBTQ_test/COL509/COL509.mp3')], sampling_rate=16000, num_samples=69195095, duration=4324.69340625, transforms=None)
[extra info] When calling: Recording.load_audio(args=(Recording(id='COL509', sources=[AudioSource(type='file', channels=[0], source='LGBTQ_test/COL509/COL509.mp3')], sampling_rate=16000, num_samples=69195095, duration=4324.69340625, transforms=None),) kwargs={})

2) Setting num_samples in Recording as AudioSource.load_audio().size

ValueError: The number of declared samples in the recording diverged from the one obtained when loading audio (offset=0.0, duration=None). This could be internal Lhotse's error or a faulty transform implementation. Please report this issue in Lhotse and show the following: diff=-69195094, audio.shape=(1, 138390189), recording=Recording(id='COL509', sources=[AudioSource(type='file', channels=[0], source='LGBTQ_test/COL509/COL509.mp3')], sampling_rate=16000, num_samples=138390189, duration=4324.69340625, transforms=None)
[extra info] When calling: Recording.load_audio(args=(Recording(id='COL509', sources=[AudioSource(type='file', channels=[0], source='LGBTQ_test/COL509/COL509.mp3')], sampling_rate=16000, num_samples=138390189, duration=4324.69340625, transforms=None),) kwargs={})

The duration is calculated with lhotse.kaldi.get_duration(path). How can I fix this error?

pzelasko commented 1 year ago

Just to make sure, can you show the corresponding wav.scp line from Kaldi (based on what you wrote I assume you imported it from Kaldi).

Also, can you load the audio file with torchaudio.load() and show the number of samples and the sampling rate?

shcxlee commented 1 year ago

I imported lhotse.kaldi not Kaldi (from lhotse import kaldi).

When I load the audio file with torchaudio.load() the sampling rate is 32000 and the number of samples is 138390189.

pzelasko commented 1 year ago

Your recording manifest says sampling_rate=16000 but it doesn't have a Resample transform to get there from 32000. How did you create the Recording manifest?

shcxlee commented 1 year ago

Originally I set it to 16000 by setting Recording.sampling_rate=16000 but now I fixed my code to follow the sampling rate of the audio file and it runs okay. However, now I got a new error that tensor sizes aren't matching:

Traceback (most recent call last):
  File "/dataset/./create_json.py", line 149, in <module>
    for cuts in annotate_with_whisper(rec, "en", "medium", "cuda:3"):
  File "/dataset/whisper_annotate.py", line 47, in annotate_with_whisper
    yield from _annotate_recordings(manifest, language, model_name, device)
  File "/dataset/whisper_annotate.py", line 72, in _annotate_recordings
    result = whisper.transcribe(model=model, audio=audio, language=language)
  File "/env/whisper/lib/python3.9/site-packages/whisper/transcribe.py", line 84, in transcribe
    mel = log_mel_spectrogram(audio)
  File "/env/whisper/lib/python3.9/site-packages/whisper/audio.py", line 119, in log_mel_spectrogram
    mel_spec = filters @ magnitudes
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 201] but got: [1, 200].

Below is my create_json.py for creating RecordingSet and whisper_annotate.py is the same python code as whisper that I cited in the main post above.

from whisper_annotate import annotate_with_whisper
import math
from xml.etree.ElementTree import parse
import os
import logging
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from pprint import pprint
from dataclasses import asdict
import argparse
import torch
import torchaudio
from torch.utils.data import DataLoader

from lhotse import (
    Recording,
    AudioSource,
    RecordingSet,
    kaldi,
)

def prepare_recording(
    archive_root: str, name: str, output_dir: str
) -> RecordingSet:
    """
    Prepare manifests for the ASCEND corpus.
    The manifests are created in a dict with three splits: train, validation and test.
    Each split contains a RecordingSet and SupervisionSet in a dict under keys 'recordings' and 'supervisions'.
    :param archive_root: Path to the unpacked ASCEND data.
    :return: A dict with standard corpus splits containing the manifests.
    """
    archive_root = Path(archive_root+"/"+name)
    output_dir = Path(output_dir)
    # corpus = {}
    r_segments = [name]

    # for name in names:
    #     file_path = archive_root/f'{name}.mp3'
    #     r_segments.append(
    #         Recording(
    #             id=name,
    #             sources=[AudioSource(
    #                 type='file', source=file_path, channels=[0], )],
    #             sampling_rate=16000,
    #             num_samples=get_num_samples(file_path),
    #             duration=get_duration(file_path),
    #             transforms=None
    #         )
    #     )

    # r_segments = []

    # recordings = RecordingSet.from_recording(r_segments)

    # parsing xml
    tree = parse(f'{archive_root}/{name}_files.xml')
    #logging.info(tree)
    root = tree.getroot()
    audiofile = root.findall(".//*[@name='"+f'{name}.mp3'+"']/length")
    # Find sampling rate
    _, sample_rate = torchaudio.load(f'{archive_root}/{name}.mp3')
    duration = kaldi.get_duration(f'{archive_root}/{name}.mp3')
    sources = [AudioSource(type='file', source=f'{archive_root}/{name}.mp3', channels=[0])]
    length = sources[0].load_audio().size

    # Find sampling rate
    _, sample_rate = torchaudio.load(f'{archive_root}/{name}.mp3')

    recordings = RecordingSet.from_recordings(
        Recording(
            id=name,
            sources=sources,
            sampling_rate=sample_rate,
            num_samples=length,
            duration=duration,
            transforms=None
        )
        for name in r_segments
    )

    # recordings.to_file(output_dir / f"{name}_recordings.jsonl.gz")

    return recordings

if __name__ == '__main__':
    logger = logging.getLogger()
    logger.setLevel(logging.INFO)
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "id_file", help="path to the text file containing identifiers", type=str)
    parser.add_argument(
        "save_dir", help="path to the directory to save files", type=str)
    args = parser.parse_args()

    if args.id_file:
        ids_file_path = args.id_file
    if args.save_dir:
        save_directory = args.save_dir

    # download_data(all_ids, save_directory, 'audio', True)
    all_ids = os.listdir(ids_file_path)
    logging.info(all_ids)
    for id in all_ids:
        logging.info(f"Parsing {id}...")
        rec = prepare_recording(ids_file_path, id, save_directory)
        for cuts in annotate_with_whisper(rec, "en", "medium", "cuda:3"):
            cuts.to_file(save_directory / f"{id}.jsonl.gz")

https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L47 https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L72

pzelasko commented 1 year ago

RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 201] but got: [1, 200].

I think this issue can come up in feature extraction when you have a recording that's too short? You might want to filter super short ones and re-try.

pzelasko commented 1 year ago

BTW you can also create the recording manifests with one line recordings = RecordingSet.from_dir(archive_root, "*.mp3", num_jobs=4)

Vanka0051 commented 11 months ago

Got the same problem. This is caused by torchaudio[0.13.0]. The result using torchaudio.info is different from torchaudio.load

>>> torchaudio.__version__
'0.13.0'
>>> path_or_fileobj = "raw.wav"
>>> test1 = torchaudio.info(path_or_fileobj)
>>> test2 = torchaudio.load(path_or_fileobj)
>>> print(test1.num_channels, test1.num_frames, test1.sample_rate, test1.bits_per_sample, test1.encoding)
1 144000 16000 16 PCM_S
>>> print(test2[0].shape, test2[1])
torch.Size([1, 42284]) 16000

pzelasko commented 11 months ago

Double check your data file: it's also possible that the RIFF header has incorrect metadata (unfortunately). But if the file is OK, then I recommend updating to torchaudio 2.0+ so Lhotse can leverage the ffmpeg backend that's likely free from this issue (you might need to set env var TORCHAUDIO_USE_BACKEND_DISPATCHER=1).

cassiotbatista commented 9 months ago

it's also possible that the RIFF header has incorrect metadata (unfortunately)

That's indeed what happened when I was dealing with People's Speech dataset. It was just a couple of files tho, but since I couldn't avoid the exception being raised, I ended up having to load each file, count the actual number of samples and compare it with the header info, and then removing the utt from the cuts file in case of a huge mismatch. It took a while but it worked.

IMHO, if possible, it'd be nice to have an option to only warn the user about this possible mismatch in the header (or automatically ignoring the utt while reading the cuts file) instead of raising ValueError :)

pzelasko commented 9 months ago

There is such an option -- see collate_audio(..., fault_tolerant=True) (also in class AudioSamples). Or if you're loading individual audio files yourself, search for suppress_audio_loading_errors context manager.

lhotse-speech / lhotse

The number of declared samples in the recording diverged from the one obtained when loading audio #886