just thought - Githubissues

yenerismail commented 5 months ago

Hello, As an end user, (My suggestions may be ridiculous because I have no software knowledge.) "The model used is Whisper-Small-244M with KV cache." Can Whisper-Large-V3 be used? Can the user make a choice? (such as tiny, base, small, medium, large) CPU and GPU are advancing rapidly in GSM phones. For example, my phone is Qualcomm Snapdragon 8 Gen 2 and Adreno(TM) 740 Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode). (Walkie Talkie Mode) Can it be adapted for a single language?

Is it possible to input voice for Conversation Mode? (Without keyboard feature)

niedev commented 5 months ago

Hello,

Don't worry, no suggestion is ridiculous.

Can Whisper-Large-V3 be used? Can the user make a choice? (such as tiny, base, small, medium, large) CPU and GPU are advancing rapidly in GSM phones. For example, my phone is Qualcomm Snapdragon 8 Gen 2 and Adreno(TM) 740

The most limiting factor for integrating larger models is the amount of RAM on the phones. Let's start from the assumption that usually the maximum amount of RAM usable for an application is half of the phone's RAM (the rest is consumed by the operating system and other apps).

To explain it in simple terms, an AI model must be loaded entirely into RAM to be executed, to calculate its minimum consumption (because they usually consume more) in bytes just multiply the number of its parameters by 1 (normally by 4 but my models are quantized so each parameter weighs 1 byte instead of 4), so Whisper Large for example, which has 1.5B parameters would consume 1.5GB (so it would also be usable, but let's continue).

In the case of my app I should keep both Whisper and the translation model (in this case NLLB) in RAM, and in the case of Whisper, the increase in quality from the small model onwards is gradually smaller, in fact based on the data and my tests, the quality of Whisper small is already very good, the side that needs most improvement is the translation, in this case in fact, unlike Whisper, other translator models with more parameters have a significantly higher quality than NLLB.

Precisely for this reason before the release of the app I tried Madlad, a 3B parameter translator (4GB of RAM used, because to maintain the quality I had to leave some parameters at 4bytes), and together with Whisper small the total RAM consumption was about 5GB (because even Whisper small consumes more than expected), and even with my phone with 12GB of RAM, being so close to the limit (6GB for a 12GB phone), sometimes the app crashed randomly.

So I would say that, at least for now, only those who have a phone with 16GB of RAM can enjoy a better experience than the current one (even if slower), and they are too few to justify the time needed to add other models. Although when OnnxRuntime will support 0.5 byte (4bit) quantization I will probably be able to include Madlad among the options (and before that I could also add Whisper base).

I have already gone too long 🙃, so for the execution speed I'll just tell you that I can only use the CPU, because to use the GPU I have to use Android APIs (NNAPI) that are only supported by a few CPU models 😡 (my Snapdragon 8+ Gen 1 is not supported for example).

Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode). (Walkie Talkie Mode) Can it be adapted for a single language? Is it possible to input voice for Conversation Mode? (Without keyboard feature)

I didn't understand these questions, what do you mean?

yenerismail commented 5 months ago

Hello, Firstly, thank you for your reply. I live in Türkiye and use Turkish language. For translation, I use google translate. "Can corrections be made during the conversation to prevent people from understanding and translating the wrong word? (Walkie Talkie Mode)." Example: Spoken: "Mr. Ismail, shall I pour some tea?" Translation, "Brother Esma, shall I pour some tea?"

I think the accuracy rate is due to whisper. I wanted to ask if there could be corrections for reasons like these. I don't think there can be a permanent fix. This might be an option. For each language, accuracy rates vary.

(Walkie Talkie Mode) Can it be adapted for a single language? I thought that there would be no translation differences for the two people speaking, considering that it was the same language. The people speaking may be deaf or hard of hearing, so I suggested this with this in mind.

Enjoy your work,

Kishlay-notabot commented 5 months ago

thanks for explaining in such a nice detail @niedev i'm not into AI but its fun to know!

niedev commented 5 months ago

Can it be adapted for a single language?

You can already do this by setting the same language for the two languages in WalkieTalkie mode, I adapted the WalkieTalkie mode to become practically a transcriptor in that case.

Can corrections be made during the conversation to prevent people from understanding and translating the wrong word?

Turkish seems to have problem also for translation, in this particular case probably the language identification have failed and the app translated English text into English, thinking it was Turkish, solving this is complicated because the method I have found to improve the language recognition hurts the performance quite a bit, so I can implement this tecniche only when I will optimize the Whisper speed even more (or maybe I will add an option to manually specify the language spoken in WalkieTalkie mode).

data-man commented 4 months ago

Great project, thank you!

I hope https://github.com/ggerganov/whisper.cpp can be useful for you.

niedev commented 4 months ago

Thank you @data-man! I already tried whisper.cpp during the development of RTranslator 2.0 but the inference speed is slower than OnnxRuntime, so in the end I opted for the latter.

data-man commented 4 months ago

Oh, I forgot about https://github.com/rhasspy/piper. :)

niedev commented 4 months ago

Oh, I forgot about https://github.com/rhasspy/piper. :)

Oh I didn't know these models, I'll take a look at them, thanks!

yenerismail commented 4 months ago

Hello, I thought it might be useful for you. https://github.com/j3soon/whisper-to-input https://github.com/andrebarsotti/lab-voice-to-text-android https://github.com/DakeQQ/VAD-ASR-SC-for-Android

Good works,

niedev commented 4 months ago

@yenerismail Thank you! I'll take a look at these projects.

niedev / RTranslator

just thought #30