jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.53k stars 172 forks source link

Memory leak #410

Open maxlund opened 2 hours ago

maxlund commented 2 hours ago

Hi,

First off, thank you for this great implementation, really good stuff!

When using the newest stable-ts version on Windows to run the large-v3-turbo model I think there might be a memory leak of some sort when transcribing longer (1h+) audio, the RAM (not VRAM) usage goes way up:

image

RAM usage seems to be steadily increasing until we eventually get an OOM error:

Exception occurred: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1921792 bytes.
Traceback (most recent call last):
  File "stable_whisper\whisper_word_level\original_whisper.py", line 1437, in transcribe_stable
  File "stable_whisper\audio\__init__.py", line 373, in next_chunk
  File "stable_whisper\audio\__init__.py", line 341, in _read_append_to_buffer
RuntimeError: [enforce fail at alloc_cpu.cpp:114] data. DefaultCPUAllocator: not enough memory: you tried to allocate 1921792 bytes

I uploaded the audio file (runtime 02:27:47) which caused the error above here

We also have a very long audio file uploaded here (10h+ long, mostly silence), which you could perhaps use if the file above does not reproduce the issue.

We have been using your library for a while, and didn't observe any of these issues prior to switching over to the large-v3-turbo model and using the latest version of the library. Any ideas?

Thanks again for all your fantastic work here!

maxlund commented 2 hours ago

Including a minimal example to reproduce here:

import stable_whisper
import torch

model_path = "/path/to/large-v3-turbo.pt"
audio_paths = [
    "/path/to/mozart-of-gen-z-interview.mp3",
    "/path/to/long-audio.mp3"
]
model = stable_whisper.load_model(model_path, device=torch.device('cuda'))
segments_and_start_times = list()
for audio_path in audio_paths:
    whisper_result = model.transcribe(audio=audio_path, vad=True, language="english", verbose=False)
    for res in whisper_result:
        segments_and_start_times.append([res.start, res.text, res.end])

print(segments_and_start_times)