In-memory audio input mode

Hello Whisper S2T team!

In our project we need to work with pre-loaded audio chunks and I did a small PR that adds file_io flag to the whisper s2t model. This mode allows to call transcribe() with np.arrays (without working with file io). Usage example:

model = whisper_s2t.load_model(
    model_identifier=./models/faster-whisper-large-v3",
    backend='CTranslate2',
    n_mels=128,
    file_io=False
)

# some audio chunks
audio_chunks = [np.frombuffer(my_data, np.int16).flatten().astype(np.float32)/32768.0]

result = model.transcribe(audio_chunks,
  lang_codes=lang_codes,
  tasks=tasks,
  initial_prompts=initial_prompts,
  batch_size=32
)

Please let me know if I will need to change tests or benchmarks as well in order to to merge the PR.

P.S. There is a ticket https://github.com/shashikg/WhisperS2T/issues/25 and this PR can be a first step for it. (if we control external VAD and hypothesis buffer outside of whisper s2t).

Best regards, Andrei

shashikg / WhisperS2T

In-memory audio input mode #65