Open shcxlee opened 1 year ago
Just to make sure, can you show the corresponding wav.scp
line from Kaldi (based on what you wrote I assume you imported it from Kaldi).
Also, can you load the audio file with torchaudio.load()
and show the number of samples and the sampling rate?
I imported lhotse.kaldi
not Kaldi
(from lhotse import kaldi
).
When I load the audio file with torchaudio.load()
the sampling rate is 32000 and the number of samples is 138390189.
Your recording manifest says sampling_rate=16000
but it doesn't have a Resample
transform to get there from 32000. How did you create the Recording
manifest?
Originally I set it to 16000 by setting Recording.sampling_rate=16000
but now I fixed my code to follow the sampling rate of the audio file and it runs okay. However, now I got a new error that tensor sizes aren't matching:
Traceback (most recent call last):
File "/dataset/./create_json.py", line 149, in <module>
for cuts in annotate_with_whisper(rec, "en", "medium", "cuda:3"):
File "/dataset/whisper_annotate.py", line 47, in annotate_with_whisper
yield from _annotate_recordings(manifest, language, model_name, device)
File "/dataset/whisper_annotate.py", line 72, in _annotate_recordings
result = whisper.transcribe(model=model, audio=audio, language=language)
File "/env/whisper/lib/python3.9/site-packages/whisper/transcribe.py", line 84, in transcribe
mel = log_mel_spectrogram(audio)
File "/env/whisper/lib/python3.9/site-packages/whisper/audio.py", line 119, in log_mel_spectrogram
mel_spec = filters @ magnitudes
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 201] but got: [1, 200].
Below is my create_json.py
for creating RecordingSet
and whisper_annotate.py
is the same python code as whisper that I cited in the main post above.
from whisper_annotate import annotate_with_whisper
import math
from xml.etree.ElementTree import parse
import os
import logging
from concurrent.futures import ProcessPoolExecutor
from pathlib import Path
from pprint import pprint
from dataclasses import asdict
import argparse
import torch
import torchaudio
from torch.utils.data import DataLoader
from lhotse import (
Recording,
AudioSource,
RecordingSet,
kaldi,
)
def prepare_recording(
archive_root: str, name: str, output_dir: str
) -> RecordingSet:
"""
Prepare manifests for the ASCEND corpus.
The manifests are created in a dict with three splits: train, validation and test.
Each split contains a RecordingSet and SupervisionSet in a dict under keys 'recordings' and 'supervisions'.
:param archive_root: Path to the unpacked ASCEND data.
:return: A dict with standard corpus splits containing the manifests.
"""
archive_root = Path(archive_root+"/"+name)
output_dir = Path(output_dir)
# corpus = {}
r_segments = [name]
# for name in names:
# file_path = archive_root/f'{name}.mp3'
# r_segments.append(
# Recording(
# id=name,
# sources=[AudioSource(
# type='file', source=file_path, channels=[0], )],
# sampling_rate=16000,
# num_samples=get_num_samples(file_path),
# duration=get_duration(file_path),
# transforms=None
# )
# )
# r_segments = []
# recordings = RecordingSet.from_recording(r_segments)
# parsing xml
tree = parse(f'{archive_root}/{name}_files.xml')
#logging.info(tree)
root = tree.getroot()
audiofile = root.findall(".//*[@name='"+f'{name}.mp3'+"']/length")
# Find sampling rate
_, sample_rate = torchaudio.load(f'{archive_root}/{name}.mp3')
duration = kaldi.get_duration(f'{archive_root}/{name}.mp3')
sources = [AudioSource(type='file', source=f'{archive_root}/{name}.mp3', channels=[0])]
length = sources[0].load_audio().size
# Find sampling rate
_, sample_rate = torchaudio.load(f'{archive_root}/{name}.mp3')
recordings = RecordingSet.from_recordings(
Recording(
id=name,
sources=sources,
sampling_rate=sample_rate,
num_samples=length,
duration=duration,
transforms=None
)
for name in r_segments
)
# recordings.to_file(output_dir / f"{name}_recordings.jsonl.gz")
return recordings
if __name__ == '__main__':
logger = logging.getLogger()
logger.setLevel(logging.INFO)
parser = argparse.ArgumentParser()
parser.add_argument(
"id_file", help="path to the text file containing identifiers", type=str)
parser.add_argument(
"save_dir", help="path to the directory to save files", type=str)
args = parser.parse_args()
if args.id_file:
ids_file_path = args.id_file
if args.save_dir:
save_directory = args.save_dir
# download_data(all_ids, save_directory, 'audio', True)
all_ids = os.listdir(ids_file_path)
logging.info(all_ids)
for id in all_ids:
logging.info(f"Parsing {id}...")
rec = prepare_recording(ids_file_path, id, save_directory)
for cuts in annotate_with_whisper(rec, "en", "medium", "cuda:3"):
cuts.to_file(save_directory / f"{id}.jsonl.gz")
https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L47 https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L72
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 201] but got: [1, 200].
I think this issue can come up in feature extraction when you have a recording that's too short? You might want to filter super short ones and re-try.
BTW you can also create the recording manifests with one line recordings = RecordingSet.from_dir(archive_root, "*.mp3", num_jobs=4)
Got the same problem. This is caused by torchaudio[0.13.0].
The result using torchaudio.info
is different from torchaudio.load
>>> torchaudio.__version__
'0.13.0'
>>> path_or_fileobj = "raw.wav"
>>> test1 = torchaudio.info(path_or_fileobj)
>>> test2 = torchaudio.load(path_or_fileobj)
>>> print(test1.num_channels, test1.num_frames, test1.sample_rate, test1.bits_per_sample, test1.encoding)
1 144000 16000 16 PCM_S
>>> print(test2[0].shape, test2[1])
torch.Size([1, 42284]) 16000
Double check your data file: it's also possible that the RIFF header has incorrect metadata (unfortunately). But if the file is OK, then I recommend updating to torchaudio 2.0+ so Lhotse can leverage the ffmpeg backend that's likely free from this issue (you might need to set env var TORCHAUDIO_USE_BACKEND_DISPATCHER=1
).
it's also possible that the RIFF header has incorrect metadata (unfortunately)
That's indeed what happened when I was dealing with People's Speech dataset. It was just a couple of files tho, but since I couldn't avoid the exception being raised, I ended up having to load each file, count the actual number of samples and compare it with the header info, and then removing the utt from the cuts file in case of a huge mismatch. It took a while but it worked.
IMHO, if possible, it'd be nice to have an option to only warn the user about this possible mismatch in the header (or automatically ignoring the utt while reading the cuts file) instead of raising ValueError
:)
There is such an option -- see collate_audio(..., fault_tolerant=True)
(also in class AudioSamples
). Or if you're loading individual audio files yourself, search for suppress_audio_loading_errors
context manager.
Hello,
I am trying to utilize
annotate_with_whisper
to generate transcriptions of my dataset. However, while running the code, I've encountered an issue that the number of samples of theRecording
does not match with the output size ofaudio
(It seems like the size ofaudio
is almost twice thenum_samples
ofRecording
) https://github.com/lhotse-speech/lhotse/blob/6a8ce1364bb3e4abe8844a6b4e9875a6dcf45f61/lhotse/workflows/whisper.py#L711) Setting
num_samples
inRecording
asmath.ceil(duration*16000)
(16000=sample rate)2) Setting
num_samples
inRecording
asAudioSource.load_audio().size
The
duration
is calculated withlhotse.kaldi.get_duration(path)
. How can I fix this error?