shashikg / WhisperS2T

An Optimized Speech-to-Text Pipeline for the Whisper Model Supporting Multiple Inference Engine
MIT License
310 stars 32 forks source link

other backends such as whisper.cpp? #33

Open BBC-Esq opened 9 months ago

BBC-Esq commented 9 months ago

If you're collecting backends, I'd be very interested in seeing whisper.cpp as a possible backend. Here are some links:

https://github.com/ggerganov/whisper.cpp https://github.com/abdeladim-s/pywhispercpp https://github.com/tigros/Whisperer https://github.com/Const-me/Whisper https://github.com/aarnphm/whispercpp

shashikg commented 8 months ago

I think whisper.cpp does not support batching. Do you know of any community implementation for batched whisper.cpp ?

BBC-Esq commented 8 months ago

The "tigros" link i gave you above, the guy names it "batch" but I'm not sure if it's "batch" in the same sense as you mean the word in a technical sense...

BBC-Esq commented 8 months ago

And apparently he uses the the "const-me/whisper" repository's approach...and just creates multiple instances though, which might be different technically than what you're referring to?

However, correct me if I'm wrong, but weren't you the one to implement batch processing with ctranslate2? I don't know of anyone else who did before you, and it'd been something I'd been looking for for awhile. WhisperX I guess kind of did it I guess...I know there was some discussion on faster-whisper about it, but I didn't think that he actually did it.

That's why I thought you could implement batch processing in whisper.cpp if it didn't already exist?

BBC-Esq commented 8 months ago

Upon further researching the issue...am I correct in understanding that you're referring to batch processing capabilities like this method within the ctranslate2 library:

Whisper::generate(const StorageView& features,
                      std::vector<std::vector<std::string>> prompts,
                      WhisperOptions options) {
      const size_t batch_size = features.dim(0);
      return post_batch<WhisperGenerationResult>(
        [features = features.sync_copy(),
         prompts = std::move(prompts),
         options = std::move(options)]
        (WhisperReplica& replica) mutable {
          return replica.generate(std::move(features), prompts, options);
        },
        batch_size);
    }

I believe that your program primarily harnesses the inherent batch processing capabilities of ctranslate2 in this manner in contrast to the faster-whisper library. WhisperS2T sends an array whereas faster-whisper doesn't basically?

And you're wondering if whisper.cpp has something similar?

BBC-Esq commented 8 months ago

I did some further research regarding the whisper.cpp library and here's what I found. Within the script itself named whisper.cpp.

https://github.com/ggerganov/whisper.cpp/blob/master/whisper.cpp

I "believe" that it allows for batch processing in this snippet. I will first provide you with the portions from the "cpp" repository that pertain to batch processing...then, if I'm able, I'll locate any python bindings for whisper.cpp that implement the batch processing feature...keep in mind that the python bindings I've found don't stay up to date as often as the cpp repository...Here goes:

MULTIPLE EXAMPLES OF BATCH REFERENCES FROM WHISPER.CPP ``` struct whisper_batch { int32_t n_tokens; whisper_token * token; whisper_pos * pos; int32_t * n_seq_id; whisper_seq_id ** seq_id; // null terminated int8_t * logits; }; static struct whisper_batch whisper_batch_init(int32_t n_tokens, int32_t n_seq_max) { whisper_batch batch = { 0, nullptr, nullptr, nullptr, nullptr, nullptr, }; batch.token = (whisper_token * ) malloc(sizeof(whisper_token) * (n_tokens)); batch.pos = (whisper_pos *) malloc(sizeof(whisper_pos) * (n_tokens)); batch.n_seq_id = (int32_t *) malloc(sizeof(int32_t) * (n_tokens)); batch.seq_id = (whisper_seq_id **) malloc(sizeof(whisper_seq_id *) * (n_tokens + 1)); for (int i = 0; i < n_tokens; ++i) { batch.seq_id[i] = (whisper_seq_id *) malloc(sizeof(whisper_seq_id) * n_seq_max); } batch.seq_id[n_tokens] = nullptr; batch.logits = (int8_t *) malloc(sizeof(int8_t) * n_tokens); return batch; } static void whisper_batch_free(struct whisper_batch batch) { if (batch.token) free(batch.token); if (batch.pos) free(batch.pos); if (batch.n_seq_id) free(batch.n_seq_id); if (batch.seq_id) { for (int i = 0; batch.seq_id[i]; ++i) { free(batch.seq_id[i]); } free(batch.seq_id); } if (batch.logits) free(batch.logits); } static void whisper_batch_prep_legacy(whisper_batch & batch, const whisper_token * tokens, int n_tokens, int n_past, int seq_id) { batch.n_tokens = n_tokens; for (int i = 0; i < n_tokens; ++i) { if (tokens) { batch.token[i] = tokens[i]; } batch.pos [i] = n_past + i; batch.n_seq_id[i] = 1; batch.seq_id [i][0] = seq_id; batch.logits [i] = 0; } batch.logits[n_tokens - 1] = 1; } ``` static struct ggml_cgraph * whisper_build_graph_decoder( whisper_context & wctx, whisper_state & wstate, const whisper_batch & batch, bool worst_case) { const auto & model = wctx.model; const auto & hparams = model.hparams; auto & kv_self = wstate.kv_self; WHISPER_ASSERT(!!kv_self.ctx); const int n_ctx = kv_self.size; const int n_state = hparams.n_text_state; const int n_head = hparams.n_text_head; const int n_layer = hparams.n_text_layer; const int n_tokens = batch.n_tokens; const int n_audio_ctx = wstate.exp_n_audio_ctx > 0 ? wstate.exp_n_audio_ctx : hparams.n_audio_ctx; const int32_t n_kv = worst_case ? n_ctx : kv_self.n; const int32_t kv_head = worst_case ? n_ctx - n_tokens : kv_self.head; //WHISPER_LOG_DEBUG("%s: n_past = %d, n_tokens = %d, n_audio_ctx = %d, n_ctx = %d\n", __func__, n_past, n_tokens, n_audio_ctx, n_ctx); struct ggml_init_params params = { /*.mem_size =*/ wstate.alloc_decode.meta.size(), /*.mem_buffer =*/ wstate.alloc_decode.meta.data(), /*.no_alloc =*/ true, }; struct ggml_context * ctx0 = ggml_init(params); ggml_cgraph * gf = ggml_new_graph_custom(ctx0, WHISPER_MAX_NODES, false); struct ggml_tensor * embd = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens); ggml_set_name(embd, "embd"); ggml_set_input(embd); struct ggml_tensor * position = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens); ggml_set_name(position, "position"); ggml_set_input(position); const float KQscale = pow(float(n_state)/n_head, -0.25); struct ggml_tensor * KQ_mask = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, n_kv, n_tokens, 1); ggml_set_name(KQ_mask, "KQ_mask"); ggml_set_input(KQ_mask); // token encoding + position encoding struct ggml_tensor * cur = ggml_add(ctx0, ggml_get_rows(ctx0, model.d_te, embd), ggml_get_rows(ctx0, model.d_pe, position)); struct ggml_tensor * inpL = cur; for (int il = 0; il < n_layer; ++il) { const auto & layer = model.layers_decoder[il]; // norm { cur = ggml_norm(ctx0, inpL, hparams.eps); // cur = ln_0_w*cur + ln_0_b cur = ggml_add(ctx0, ggml_mul(ctx0, cur, layer.attn_ln_0_w), layer.attn_ln_0_b); } // self-attention { struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, layer.attn_q_w, cur); Qcur = ggml_add(ctx0, Qcur, layer.attn_q_b); Qcur = ggml_scale(ctx0, Qcur, KQscale); // note: no bias for Key struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, layer.attn_k_w, cur); Kcur = ggml_scale(ctx0, Kcur, KQscale); // store key and value to memory { struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, layer.attn_v_w, cur); Vcur = ggml_add(ctx0, Vcur, layer.attn_v_b); Vcur = ggml_transpose(ctx0, ggml_reshape_2d(ctx0, Vcur, n_state, n_tokens)); struct ggml_tensor * k = ggml_view_1d(ctx0, kv_self.k, n_tokens*n_state, (ggml_element_size(kv_self.k)*n_state)*(il*n_ctx + kv_head)); struct ggml_tensor * v = ggml_view_2d(ctx0, kv_self.v, n_tokens, n_state, ( n_ctx)*ggml_element_size(kv_self.v), (il*n_ctx)*ggml_element_size(kv_self.v)*n_state + kv_head*ggml_element_size(kv_self.v)); ggml_build_forward_expand(gf, ggml_cpy(ctx0, Kcur, k)); ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v)); } // ------ struct ggml_tensor * Q = ggml_permute(ctx0, ggml_reshape_3d(ctx0, Qcur, n_state/n_head, n_head, n_tokens), 0, 2, 1, 3); struct ggml_tensor * K = ggml_view_3d(ctx0, kv_self.k, n_state/n_head, n_kv, n_head, ggml_element_size(kv_self.k)*n_state, ggml_element_size(kv_self.k)*n_state/n_head, ggml_element_size(kv_self.k)*n_state*n_ctx*il); // K * Q struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q); //struct ggml_tensor * KQ_scaled = ggml_scale(ctx0, KQ, KQ_scale); //struct ggml_tensor * KQ_masked = ggml_diag_mask_inf(ctx0, KQ, n_past); struct ggml_tensor * KQ_masked = ggml_add(ctx0, KQ, KQ_mask); struct ggml_tensor * KQ_soft_max = ggml_soft_max(ctx0, KQ_masked); struct ggml_tensor * V = ggml_view_3d(ctx0, kv_self.v, n_kv, n_state/n_head, n_head, n_ctx*ggml_element_size(kv_self.v), n_ctx*ggml_element_size(kv_self.v)*n_state/n_head, n_ctx*ggml_element_size(kv_self.v)*n_state*il); struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max); struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3); cur = ggml_cpy(ctx0, KQV_merged, ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_state, n_tokens)); } // projection { cur = ggml_mul_mat(ctx0, layer.attn_ln_1_w, cur); cur = ggml_add(ctx0, cur, layer.attn_ln_1_b); } // add the input struct ggml_tensor * inpCA = ggml_add(ctx0, cur, inpL); // norm { cur = ggml_norm(ctx0, inpCA, hparams.eps); // note: we use inpCA here // cur = ln_0_w*cur + ln_0_b cur = ggml_add(ctx0, ggml_mul(ctx0, cur, layer.cross_attn_ln_0_w), layer.cross_attn_ln_0_b); } // cross-attention { struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, layer.cross_attn_q_w, cur); Qcur = ggml_add(ctx0, Qcur, layer.cross_attn_q_b); Qcur = ggml_scale(ctx0, Qcur, KQscale); // Kcross is already scaled struct ggml_tensor * Kcross = ggml_view_3d(ctx0, wstate.kv_cross.k, n_state/n_head, n_audio_ctx, n_head, ggml_element_size(wstate.kv_cross.k)*n_state, ggml_element_size(wstate.kv_cross.k)*n_state/n_head, ggml_element_size(wstate.kv_cross.k)*n_state*n_audio_ctx*il); //struct ggml_tensor * Vcross = // ggml_reshape_3d(ctx0, // ggml_view_1d(ctx0, wstate.kv_cross.v, n_audio_ctx*n_state, il*n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state), // n_state/n_head, n_head, n_audio_ctx); //struct ggml_tensor * V_trans = // ggml_cpy(ctx0, // ggml_permute(ctx0, Vcross, 1, 2, 0, 3), // ggml_new_tensor_3d(ctx0, Vcross->type, n_audio_ctx, n_state/n_head, n_head)); struct ggml_tensor * V = ggml_view_3d(ctx0, wstate.kv_cross.v, n_audio_ctx, n_state/n_head, n_head, n_audio_ctx*ggml_element_size(wstate.kv_cross.v), n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state/n_head, n_audio_ctx*ggml_element_size(wstate.kv_cross.v)*n_state*il); // ------ struct ggml_tensor * Q = ggml_permute(ctx0, ggml_reshape_3d(ctx0, Qcur, n_state/n_head, n_head, n_tokens), 0, 2, 1, 3); // K * Q struct ggml_tensor * KQ = ggml_mul_mat(ctx0, Kcross, Q); //struct ggml_tensor * KQ_scaled = // ggml_scale(ctx0, // KQ, // ggml_new_f32(ctx0, 1.0f/sqrt(float(n_state)/n_head)) // ); // no masking for cross-attention //struct ggml_tensor * KQ_masked = ggml_diag_mask_inf(ctx0, KQ_scaled, n_past); struct ggml_tensor * KQ_soft_max = ggml_soft_max(ctx0, KQ); struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V, KQ_soft_max); struct ggml_tensor * KQV_merged = ggml_permute(ctx0, KQV, 0, 2, 1, 3); // cur = KQV_merged.contiguous().view(n_state, n_tokens) cur = ggml_cpy(ctx0, KQV_merged, ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_state, n_tokens)); } // projection { cur = ggml_mul_mat(ctx0, layer.cross_attn_ln_1_w, cur); cur = ggml_add(ctx0, cur, layer.cross_attn_ln_1_b); } // add the input cur = ggml_add(ctx0, cur, inpCA); struct ggml_tensor * inpFF = cur; // feed-forward network { // norm { cur = ggml_norm(ctx0, inpFF, hparams.eps); // cur = mlp_ln_w*cur + mlp_ln_b cur = ggml_add(ctx0, ggml_mul(ctx0, cur, layer.mlp_ln_w), layer.mlp_ln_b); } // fully connected cur = ggml_mul_mat(ctx0, layer.mlp_0_w, cur); cur = ggml_add(ctx0, cur, layer.mlp_0_b); // GELU activation cur = ggml_gelu(ctx0, cur); // projection cur = ggml_mul_mat(ctx0, layer.mlp_1_w, cur); cur = ggml_add(ctx0, cur, layer.mlp_1_b); } inpL = ggml_add(ctx0, cur, inpFF); } cur = inpL; // norm { cur = ggml_norm(ctx0, cur, hparams.eps); cur = ggml_add(ctx0, ggml_mul(ctx0, cur, model.d_ln_w), model.d_ln_b); } // compute logits only for the last token // comment this line to compute logits for all n_tokens // might be useful in the future //cur = ggml_view_2d(ctx0, cur, cur->ne[0], 1, cur->nb[1], (cur->ne[1] - 1)*cur->nb[1]); struct ggml_tensor * logits = ggml_mul_mat(ctx0, model.d_te, cur); ggml_build_forward_expand(gf, logits); ggml_free(ctx0); return gf; } // evaluate the decoder // // given text prompt + audio features -> computes the logits for the next token // // - model: the model // - n_threads: number of threads to use // - tokens: text prompt // - n_tokens: number of tokens in the prompt // - n_past: number of past tokens to prefix the prompt with // static bool whisper_decode_internal( whisper_context & wctx, whisper_state & wstate, const whisper_batch & batch, const int n_threads, ggml_abort_callback abort_callback, void * abort_callback_data) { const int64_t t_start_us = ggml_time_us(); const auto & model = wctx.model; const auto & hparams = model.hparams; const int n_vocab = hparams.n_vocab; const int n_tokens = batch.n_tokens; auto & logits_out = wstate.logits; struct ggml_tensor * logits; // find KV slot for the batch { auto & kv_self = wstate.kv_self; if (!whisper_kv_cache_find_slot(kv_self, batch)) { return false; } kv_self.n = whisper_kv_cache_cell_max(kv_self); //kv_self.n = std::min((int32_t) hparams.n_text_ctx, std::max(32, whisper_kv_cache_cell_max(kv_self))); //printf("n_tokens = %5d, kv_self.head = %5d, kv_self.n = %5d, seq_id = %5d\n", batch.n_tokens, kv_self.head, kv_self.n, batch.seq_id[0][0]); } // decoder { auto & alloc = wstate.alloc_decode.alloc; ggml_cgraph * gf = whisper_build_graph_decoder(wctx, wstate, batch, false); if (!ggml_gallocr_alloc_graph(alloc, gf)) { // should never happen as we pre-allocate the memory return false; } // set the inputs { struct ggml_tensor * embd = ggml_graph_get_tensor(gf, "embd"); ggml_backend_tensor_set(embd, batch.token, 0, n_tokens*ggml_element_size(embd)); } { struct ggml_tensor * position = ggml_graph_get_tensor(gf, "position"); for (int i = 0; i < n_tokens; ++i) { const int32_t val = batch.pos[i]; ggml_backend_tensor_set(position, &val, i*sizeof(int32_t), sizeof(int32_t)); } } { struct ggml_tensor * KQ_mask = ggml_graph_get_tensor(gf, "KQ_mask"); auto & kv_self = wstate.kv_self; const int32_t n_kv = kv_self.n; wstate.inp_mask.resize(n_kv*n_tokens); float * data = wstate.inp_mask.data(); memset(data, 0, ggml_nbytes(KQ_mask)); for (int h = 0; h < 1; ++h) { for (int j = 0; j < n_tokens; ++j) { const whisper_pos pos = batch.pos[j]; const whisper_seq_id seq_id = batch.seq_id[j][0]; for (int i = 0; i < n_kv; ++i) { if (!kv_self.cells[i].has_seq_id(seq_id) || kv_self.cells[i].pos > pos) { data[h*(n_kv*n_tokens) + j*n_kv + i] = -INFINITY; } } } } ggml_backend_tensor_set(KQ_mask, wstate.inp_mask.data(), 0, ggml_nelements(KQ_mask)*sizeof(float)); } logits = gf->nodes[gf->n_nodes - 1]; if (!ggml_graph_compute_helper(wstate.backend, gf, n_threads)) { return false; } } logits_out.resize(n_tokens*n_vocab); for (int i = 0; i < n_tokens; i++) { if (batch.logits[i] == 0) { continue; } ggml_backend_tensor_get(logits, logits_out.data() + (n_vocab*i), sizeof(float)*(n_vocab*i), sizeof(float)*n_vocab); } if (batch.n_tokens > 1) { //printf("%s: used_mem = %f MB, %f MB, %f MB %f MB %f MB\n", __func__, // ggml_used_mem(ctx0)/1e6, // wstate.get_buf_max_mem(0)/1e6, // wstate.get_buf_max_mem(1)/1e6, // wstate.get_buf_max_mem(2)/1e6, // wstate.get_buf_max_mem(3)/1e6); } if (batch.n_tokens == 1) { wstate.t_decode_us += ggml_time_us() - t_start_us; wstate.n_decode++; } else if (batch.n_tokens < 16) { wstate.t_batchd_us += ggml_time_us() - t_start_us; wstate.n_batchd += n_tokens; } else { wstate.t_prompt_us += ggml_time_us() - t_start_us; wstate.n_prompt += n_tokens; } return !(abort_callback && abort_callback(abort_callback_data)); } ``` ``` int whisper_decode_with_state(struct whisper_context * ctx, struct whisper_state * state, const whisper_token * tokens, int n_tokens, int n_past, int n_threads) { whisper_batch_prep_legacy(state->batch, tokens, n_tokens, n_past, 0); whisper_kv_cache_seq_rm(state->kv_self, 0, n_past, -1); if (!whisper_decode_internal(*ctx, *state, state->batch, n_threads, nullptr, nullptr)) { WHISPER_LOG_ERROR("%s: failed to eval\n", __func__); return 1; } return 0; } int whisper_decode(struct whisper_context * ctx, const whisper_token * tokens, int n_tokens, int n_past, int n_threads) { if (ctx->state == nullptr) { WHISPER_LOG_ERROR("%s: ERROR state was not loaded.\n", __func__); return -1; } return whisper_decode_with_state(ctx, ctx->state, tokens, n_tokens, n_past, n_threads); } int whisper_tokenize(struct whisper_context * ctx, const char * text, whisper_token * tokens, int n_max_tokens) { const auto res = tokenize(ctx->vocab, text); if (n_max_tokens < (int) res.size()) { WHISPER_LOG_ERROR("%s: too many resulting tokens: %d (max %d)\n", __func__, (int) res.size(), n_max_tokens); return -1; } for (int i = 0; i < (int) res.size(); i++) { tokens[i] = res[i]; } return res.size(); } int whisper_lang_max_id() { auto max_id = 0; for (const auto & kv : g_lang) { max_id = std::max(max_id, kv.second.first); } return max_id; } int whisper_lang_id(const char * lang) { if (!g_lang.count(lang)) { for (const auto & kv : g_lang) { if (kv.second.second == lang) { return kv.second.first; } } WHISPER_LOG_ERROR("%s: unknown language '%s'\n", __func__, lang); return -1; } return g_lang.at(lang).first; } const char * whisper_lang_str(int id) { for (const auto & kv : g_lang) { if (kv.second.first == id) { return kv.first.c_str(); } } WHISPER_LOG_ERROR("%s: unknown language id %d\n", __func__, id); return nullptr; } const char * whisper_lang_str_full(int id) { for (const auto & kv : g_lang) { if (kv.second.first == id) { return kv.second.second.c_str(); } } WHISPER_LOG_ERROR("%s: unknown language id %d\n", __func__, id); return nullptr; } int whisper_lang_auto_detect_with_state( struct whisper_context * ctx, struct whisper_state * state, int offset_ms, int n_threads, float * lang_probs) { const int seek = offset_ms/10; if (seek < 0) { WHISPER_LOG_ERROR("%s: offset %dms is before the start of the audio\n", __func__, offset_ms); return -1; } if (seek >= state->mel.n_len_org) { WHISPER_LOG_ERROR("%s: offset %dms is past the end of the audio (%dms)\n", __func__, offset_ms, state->mel.n_len_org*10); return -2; } // run the encoder if (whisper_encode_with_state(ctx, state, seek, n_threads) != 0) { WHISPER_LOG_ERROR("%s: failed to encode\n", __func__); return -6; } const std::vector prompt = { whisper_token_sot(ctx) }; if (whisper_decode_with_state(ctx, state, prompt.data(), prompt.size(), 0, n_threads) != 0) { WHISPER_LOG_ERROR("%s: failed to decode\n", __func__); return -7; } auto & logits_id = state->decoders[0].logits_id; logits_id.clear(); for (const auto & kv : g_lang) { const auto token_lang = whisper_token_lang(ctx, kv.second.first); logits_id.emplace_back(state->logits[token_lang], kv.second.first); } // sort descending { using pair_type = std::remove_reference::type::value_type; std::sort(logits_id.begin(), logits_id.end(), [](const pair_type & a, const pair_type & b) { return a.first > b.first; }); } // softmax { const auto max = logits_id[0].first; double sum = 0.0f; for (auto & kv : logits_id) { kv.first = exp(kv.first - max); sum += kv.first; } for (auto & kv : logits_id) { kv.first /= sum; } } { for (const auto & prob : logits_id) { if (lang_probs) { lang_probs[prob.second] = prob.first; } //printf("%s: lang %2d (%3s): %f\n", __func__, prob.second, whisper_lang_str(prob.second), prob.first); } } return logits_id[0].second; } int whisper_lang_auto_detect( struct whisper_context * ctx, int offset_ms, int n_threads, float * lang_probs) { return whisper_lang_auto_detect_with_state(ctx, ctx->state, offset_ms, n_threads, lang_probs); } int whisper_model_n_vocab(struct whisper_context * ctx) { return ctx->model.hparams.n_vocab; } int whisper_model_n_audio_ctx(struct whisper_context * ctx) { return ctx->model.hparams.n_audio_ctx; } int whisper_model_n_audio_state(struct whisper_context * ctx) { return ctx->model.hparams.n_audio_state; } int whisper_model_n_audio_head(struct whisper_context * ctx) { return ctx->model.hparams.n_audio_head; } int whisper_model_n_audio_layer(struct whisper_context * ctx) { return ctx->model.hparams.n_audio_layer; } int whisper_model_n_text_ctx(struct whisper_context * ctx) { return ctx->model.hparams.n_text_ctx; } int whisper_model_n_text_state(struct whisper_context * ctx) { return ctx->model.hparams.n_text_state; } int whisper_model_n_text_head(struct whisper_context * ctx) { return ctx->model.hparams.n_text_head; } int whisper_model_n_text_layer(struct whisper_context * ctx) { return ctx->model.hparams.n_text_layer; } int whisper_model_n_mels(struct whisper_context * ctx) { return ctx->model.hparams.n_mels; } int whisper_model_ftype(struct whisper_context * ctx) { return ctx->model.hparams.ftype; } int whisper_model_type(struct whisper_context * ctx) { return ctx->model.type; } const char *whisper_model_type_readable(struct whisper_context * ctx) { switch (ctx->model.type) { case e_model::MODEL_TINY: return "tiny"; case e_model::MODEL_BASE: return "base"; case e_model::MODEL_SMALL: return "small"; case e_model::MODEL_MEDIUM: return "medium"; case e_model::MODEL_LARGE: return "large"; default: return "unknown"; } } int whisper_n_len_from_state(struct whisper_state * state) { return state->mel.n_len_org; } int whisper_n_len(struct whisper_context * ctx) { return ctx->state->mel.n_len_org; } int whisper_n_vocab(struct whisper_context * ctx) { return ctx->vocab.n_vocab; } int whisper_n_text_ctx(struct whisper_context * ctx) { return ctx->model.hparams.n_text_ctx; } int whisper_n_audio_ctx(struct whisper_context * ctx) { return ctx->model.hparams.n_audio_ctx; } int whisper_is_multilingual(struct whisper_context * ctx) { return ctx->vocab.is_multilingual() ? 1 : 0; } float * whisper_get_logits(struct whisper_context * ctx) { return ctx->state->logits.data(); } float * whisper_get_logits_from_state(struct whisper_state * state) { return state->logits.data(); } const char * whisper_token_to_str(struct whisper_context * ctx, whisper_token token) { return ctx->vocab.id_to_token.at(token).c_str(); } whisper_token whisper_token_eot(struct whisper_context * ctx) { return ctx->vocab.token_eot; } whisper_token whisper_token_sot(struct whisper_context * ctx) { return ctx->vocab.token_sot; } whisper_token whisper_token_solm(struct whisper_context * ctx) { return ctx->vocab.token_solm; } whisper_token whisper_token_prev(struct whisper_context * ctx) { return ctx->vocab.token_prev; } whisper_token whisper_token_nosp(struct whisper_context * ctx) { return ctx->vocab.token_nosp; } whisper_token whisper_token_not(struct whisper_context * ctx) { return ctx->vocab.token_not; } whisper_token whisper_token_beg(struct whisper_context * ctx) { return ctx->vocab.token_beg; } whisper_token whisper_token_lang(struct whisper_context * ctx, int lang_id) { return whisper_token_sot(ctx) + 1 + lang_id; } whisper_token whisper_token_translate(struct whisper_context * ctx) { return ctx->vocab.token_translate; } whisper_token whisper_token_transcribe(struct whisper_context * ctx) { return ctx->vocab.token_transcribe; } void whisper_print_timings(struct whisper_context * ctx) { const int64_t t_end_us = ggml_time_us(); WHISPER_LOG_INFO("\n"); WHISPER_LOG_INFO("%s: load time = %8.2f ms\n", __func__, ctx->t_load_us / 1000.0f); if (ctx->state != nullptr) { const int32_t n_sample = std::max(1, ctx->state->n_sample); const int32_t n_encode = std::max(1, ctx->state->n_encode); const int32_t n_decode = std::max(1, ctx->state->n_decode); const int32_t n_batchd = std::max(1, ctx->state->n_batchd); const int32_t n_prompt = std::max(1, ctx->state->n_prompt); WHISPER_LOG_INFO("%s: fallbacks = %3d p / %3d h\n", __func__, ctx->state->n_fail_p, ctx->state->n_fail_h); WHISPER_LOG_INFO("%s: mel time = %8.2f ms\n", __func__, ctx->state->t_mel_us / 1000.0f); WHISPER_LOG_INFO("%s: sample time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_sample_us, n_sample, 1e-3f * ctx->state->t_sample_us / n_sample); WHISPER_LOG_INFO("%s: encode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_encode_us, n_encode, 1e-3f * ctx->state->t_encode_us / n_encode); WHISPER_LOG_INFO("%s: decode time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_decode_us, n_decode, 1e-3f * ctx->state->t_decode_us / n_decode); WHISPER_LOG_INFO("%s: batchd time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_batchd_us, n_batchd, 1e-3f * ctx->state->t_batchd_us / n_batchd); WHISPER_LOG_INFO("%s: prompt time = %8.2f ms / %5d runs (%8.2f ms per run)\n", __func__, 1e-3f * ctx->state->t_prompt_us, n_prompt, 1e-3f * ctx->state->t_prompt_us / n_prompt); } WHISPER_LOG_INFO("%s: total time = %8.2f ms\n", __func__, (t_end_us - ctx->t_start_us)/1000.0f); } void whisper_reset_timings(struct whisper_context * ctx) { ctx->t_start_us = ggml_time_us(); if (ctx->state != nullptr) { ctx->state->t_mel_us = 0; ctx->state->t_sample_us = 0; ctx->state->t_encode_us = 0; ctx->state->t_decode_us = 0; ctx->state->t_batchd_us = 0; ctx->state->t_prompt_us = 0; ctx->state->n_sample = 0; ctx->state->n_encode = 0; ctx->state->n_decode = 0; ctx->state->n_batchd = 0; ctx->state->n_prompt = 0; } } ``` There is one more chunk starting on line 4478 of ```whisper.cpp``` and ending on line 5898 that I felt was too long to paste there. I'll try to paste for your convenient, if i can get them, any python bindings regarding the batch functionality of ```whisper.cpp``` for you.

There is one more chunk starting on line 4478 of whisper.cpp and ending on line 5898 that I felt was too long to paste there. I'll try to paste for your convenient, if i can get them, any python bindings regarding the batch functionality of whisper.cpp for you. Thanks!

BBC-Esq commented 8 months ago

This also seems to confirm that whisper.cpp supports batch processing like huggingface and ctranslate2 do....

https://github.com/ggerganov/whisper.cpp/pull/1486

BBC-Esq commented 8 months ago

One the announcement for version 1.5 it states that they support batching...

https://github.com/ggerganov/whisper.cpp/releases?q=batch&expanded=true

And these python bindings claim to support whisper.cpp all the way to version 1.5.4:

https://github.com/abdeladim-s/pywhispercpp/releases/tag/v1.2.0

Note, the above bindings are NOT listed on whisper.cpp repo for some reason...the only two listed are the following, which haven't been updated in quite awhile...

https://github.com/stlukey/whispercpp.py https://github.com/aarnphm/whispercpp

I don't know if this means that the owner of "abdeladim" just hasn't requested a community integration shoutout or what...so I wouldn't assume necessarily that his bindings aren't good...