soupslurpr / Transcribro

Private and on-device speech recognition keyboard and service for Android.
ISC License
47 stars 1 forks source link

Bug: O. 2.0 Null output or stops responding with long input #21

Closed Charles7z closed 3 months ago

Charles7z commented 3 months ago

Android 14 GrapheneOS Pixel 6a Tested in text editor and Markor

The previous versions I tested did not do this.

I've reproduced this consistently.

Tried different settings, didn't matter if auto-start recognition enabled or not. Disabled GrapheneOS exploit protections for app, didn't make a difference.

A lengthy input causes a flicker of the screen and no output or the app will get the "not responding" message and no output.

I did the following input:

1 and 2 and 3 and 4 and 5 and 6 and...

Up to 20 worked. 25 caused problems, consistently.

System memory shows max memory of 500 used.

Edit:

I did not tap the mic to stop recognition. However I tested that and it gives a toast saying it's still working....no matter how long I wait.

Got one to work up to 30 πŸ˜‚ but it took a very long time for the output = not usable.

soupslurpr commented 3 months ago

Reproduced, investigating right now. Thanks for the report.

soupslurpr commented 3 months ago

Fixed in version 0.2.1. Version 0.2.1 also improves speed and accuracy because of the reworking that was done.

soupslurpr commented 3 months ago

Give it a try :D

Charles7z commented 3 months ago

Didn't get you're message to test till this morning so I'm assuming that I don't need to test the non-release?

I tested 0.2.1. And it's got issues.

Did the same " 1 and 2 and ..." Up to 50 .

  1. It took a very long time to output. To long to be usable.
  2. The output started changing from numbers to spelling out the numbers at twenty-seven.
  3. Output ended at "thirty-eight, and"
  4. But it didn't crash or lockup.

It was better with the smaller module 🀣 I don't think you're gaining any more accuracy and you're getting hallucinations, consistently πŸ€·β€β™‚οΈ

SayBoard is my go to voice to text because:

  1. It's fast
  2. The keyboard can be used to add punctuations & numbers, etc.... As I see fit.
  3. Because it's fast, it's fast enough to utilize the keyboard at the same time β€” would be perfect if integrated to work at the same time as your main keyboard πŸ€”

From testing these voice inputs since early Android I've concluded there are two approaches to take.

First, original Google input you had to say the punctuation and new paragraph but it worked very well, at ~30MB download. This was fast, didn't try to do everything for you. It expects the user to know what to do and have some input manually. This idea carries on with SayBoard.

The second approach is what's taking place with Google these days, your app, Futo, and all the AI apps. And that is for the app to figure out what the user wants, even maybe make it better β€” the app does everything for you approach. Google's current approach gives you ~300MB download for offline input and sucks compared to 10+years ago β€” misses all punctuation, can't even input it verbally, etc.

IMO the second approach is a ways off especially for "privacy" & offline situations. Also, the tech just isn't there IMO. Futo has functioned the best but honestly I uninstalled it in favor of SayBoard. Why? Because the only thing Futo did was add punctuation and maybe convert to numbers, maybe = at times I still have to spend time fixing, etc. At 500MB install. SayBoard is fast enough that I can make the decisions as I go and it comes out nearly perfect every time. πŸ€·β€β™‚οΈ

IMO for an offline voice input to be really useful it needs to work with a keyboard at the same time and be fast enough that you can actually use the keyboard in real time. These AI based apps are more of a, hope it gets output right 🀞situation = lazy user that doesn't want to think. 🀣

Please don't take this the wrong way. Just food for thought. And my personal opinion.

But you seemed to have added more problems and no benefit by going to a bigger module.

soupslurpr commented 3 months ago

Issue 2 is the whisper model deciding to switch it for some reason, while 3 may be because you spoke to low and the voice activity detection detected no speech as a result.

Sayboard also uses a machine learning model, VOSK. It's not any less an "AI app" than Transcribro. The difference between VOSK and Whisper is that it's way faster, but with less accuracy and robustness to different conditions.

VOSK seems to operate on a "streaming" basis. It does it in real time as the audio of the words are streamed to it. Whisper operates differently and kinda needs a whole chunk of audio at once instead of accepting streaming the words to it as it gets recorded which is unfortunate as that means it can't process the audio as it is being recorded and instead has to process it all once it ends.

But you seemed to have added more problems and no benefit by going to a bigger module.

You mean tiny (included with 0.1.0) vs base (0.2.0) right? There may be quants I can apply to tiny (sorta like making it smaller and faster at the cost of some accuracy) that increase it's speed a lot. But fundamentally if you want streaming and closest to real time, go with VOSK.

In 0.3.0, a model picker will be available to use the tiny model and other sized models, and the tiny one might be the default because it may be better with the rework and shouldn't have as many hallucinations, I need to test it. I initially thought the hallucinations were mainly because of it being tiny, but it seems it was mostly because of adding padding to the end of the speech.

Please definitely use SayBoard if it fits you better because of speed! It totally makes sense if you're fine with no automatic pronunciation and less accuracy and robustness (which might not affect you a lot depending on where and how you're using it).

Charles7z commented 3 months ago

0.2.1 tests

There's a bit of a bug when mic access is disabled and you start the app and it doesn't ask to ublock. It takes several restarts for it to work.

I'm talking loud, that isn't an issue πŸ˜‰

The current output takes longer than the input.

Using the input "1 and 2 and..."

Up to 20 took just over that long to output.

Up to 40 took about 60 to output.

I'll attach another log, if it's of any use πŸ€·β€β™‚οΈ

I enjoy checking out new apps. And testing them 😁

Yours has promise β€” reasonably functional "keyboard" with more robust translation. I can see it having a solid use. The streaming input of SayBoard has a slightly different use but the user can adapt to either. The ones that bother me are when the user has to leave the app even for simple input like new paragraphs. That was SayBoard's problem for a while β€” poor translation and no keyboard which is Google's offline issues and why I haven't used it for years.

You are sitting in a middle ground β€” translation is good enough that you shouldn't need much of a keyboard but certain input functions are available = increases user functionality.

I'm simply pointing out the two paths I see these voice inputs going down and how they can be used by the user.

soupslurpr commented 3 months ago

It crashed?

Charles7z commented 3 months ago

I'm wondering if an option to output what you've currently said instead of waiting for the app to recognize the pause would be a good option on your keyboard. That way the user could tap it and it would start translating the previous amount input and the user then could talk more while the output is working but i don't know if that would be problematic.

This, I think, is what you were saying Futo does but tours would be more manually dependent on the user.

πŸ€·β€β™‚οΈ

soupslurpr commented 3 months ago

No, the top part seems to be a log from 20 hours ago.

Okay, so it taking that long isn't a bug I'm pretty sure. It's just taking that long to process. Should take a lot less long with tiny.

soupslurpr commented 3 months ago

I'm wondering if an option to output what you've currently said instead of waiting for the app to recognize the pause would be a good option on your keyboard. That way the user could tap it and it would start translating the previous amount input and the user then could talk more while the output is working but i don't know if that would be problematic.

This, I think, is what you were saying Futo does but tours would be more manually dependent on the user.

πŸ€·β€β™‚οΈ

Hm I'm not sure I understand. FUTO shows the words that were processed as it is processing them instead of waiting until it's finished and pasting it all at once.

soupslurpr commented 3 months ago

So are you saying run the model while you're talking to show a preview in addition to running it again at the end to get a final transcription?

Charles7z commented 3 months ago

No. Your app already does what I'm saying. I'm saying, is there a way to manually trigger the "pause".

Right now i can pause for three seconds and your app will initiate the transcription. While it's doing the transcription i can start talking again. Then i can do another pause for three seconds, and so on.

Manually trigger the pause so the user can break up the long input to more manageable blocks for the app instead of waiting or ending the input. This also helps the user know end blocks that will be out put β€” user knows when they've hit the "output" key so when that part of the transcription is output they can add new paragraph if necessary.

I'm making it sound more complicated that it is 🀣

Charles7z commented 3 months ago

No it didn't crash this time.

Charles7z commented 3 months ago

I noticed a minor little problem with your app. If you hit the return key, anytime while you're using your app, it stops the app, which it should not actually do. I should be able to hit the return key after I've paused and I've gotten a certain amount of transcription done to enter a new paragraph, otherwise it minimizes the use of the key.

Charles7z commented 3 months ago

How do I tell the time in those log files?

soupslurpr commented 3 months ago

No. Your app already does what I'm saying. I'm saying, is there a way to manually trigger the "pause".

Right now i can pause for three seconds and your app will initiate the transcription. While it's doing the transcription i can start talking again. Then i can do another pause for three seconds, and so on.

Manually trigger the pause so the user can break up the long input to more manageable blocks for the app instead of waiting or ending the input. This also helps the user know end blocks that will be out put β€” user knows when they've hit the "output" key so when that part of the transcription is output they can add new paragraph if necessary.

I'm making it sound more complicated that it is 🀣

Ah, I see now. Currently, that's dependent on the voice activity detection, detecting no speech for three seconds. That's when it transcribes that as a section and starts a new section. Manually triggering it should be possible, but any suggestions on where the button would go? Maybe the left where the settings and previous input method buttons are in the same way just placed lower.

soupslurpr commented 3 months ago

I noticed a minor little problem with your app. If you hit the return key, anytime while you're using your app, it stops the app, which it should not actually do. I should be able to hit the return key after I've paused and I've gotten a certain amount of transcription done to enter a new paragraph, otherwise it minimizes the use of the key.

I can't reproduce this issue. The return button works fine for me. What app are you testing it in?

soupslurpr commented 3 months ago

How do I tell the time in those log files?

The numbers on the left are Unix time numbers (like 1712080376). You can use an online Unix time converter to get the human time such as https://www.unixtimestamp.com/

Charles7z commented 3 months ago

I don't know where I had the app stop when I hit the return button but it's not doing it now and I've tested it in a few different apps so it might have just been a weird glitch. πŸ€·β€β™‚οΈ

where to put the button. just reduce the delete key by half and put it next to that. or reduce the microphone size, it doesn't really need to be that large, you've got lots of space then πŸ˜‰

soupslurpr commented 3 months ago

Yep I was thinking just reducing the mic size and putting in the left and making the mic width align with the cancel recognition button