mkiol / dsnote

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
Mozilla Public License 2.0
545 stars 20 forks source link

README suggestions #105

Open danboid opened 7 months ago

danboid commented 7 months ago

I'd like to see the following covered in the README.

I know Sound Note only currently works with some X11 apps, but I think the global keyboard shortcuts feature is a feature that is notable enough that it should be mentioned in the README. It should be mentioned as a useful, if maybe secondary, accessability aid/feature because f you are careful (lucky) about which X11 apps you use, SN makes Linux and no doubt other open source OSs that SN coud be ported to like the BSDs etc much more accessible for anyone who has difficulty or is otherwise poor at typing. Terminal stuff, it may not be so ideal for but it can be used to great effect when composing emails etc.

I realise it probably depends on which model you are using, but what is the minimum amount of RAM required to run SN?

What is the longest speech recording that we can reasonably hope to feed into SN and expect it to cope?

Presuming the host machine has a decent amount of disk space and RAM and they're prepared to wait for the results, could a user potentially let SN run for several hours and expect it to handle a recording of such length or is it only capable of dealing with shorter recordings?

danboid commented 7 months ago

Something like this maybe:

Description

Speech Note let you take, read and translate notes in multiple languages. It uses Speech to Text, Text to Speech and Machine Translation to do so. Text and voice processing take place entirely offline, locally on your computer, without using a network connection. Your privacy is always respected. No data is sent to the Internet.

Speech note also features some virtual keyboard input support via the use of global keyboard shortcuts but this feature is currently only supported by some X11 apps and not under Wayland.

mkiol commented 7 months ago

Thanks for the questions and all suggestions.

I think the global keyboard shortcuts feature is a feature that is notable enough that it should be mentioned in the README.

You are perfectly right but I need to polish these features more. In current form they are quite unpredictable. For instance, not all shortcuts are working out of the box, some apps don't accepts "inserting to active window" and so on. At least I have to identify in which condition everything should work fine.

I realise it probably depends on which model you are using, but what is the minimum amount of RAM required to run SN?

Everything depends on a model and engine. I can't say exact numbers because I didn't make any measurements or benchmarks. For STT tasks, the lightest is Vosk Small. In TTS, eSpeak (obviously) and RHVoice. Piper is pretty efficient on CPU as well.

What is the longest speech recording that we can reasonably hope to feed into SN and expect it to cope?

You are asking about transcribing a file? SN should not crash even on very long audio. There is Voice Activity Detector and non-speech removal pre-procesing that cuts audio into smaller parts containing speech. Parts are processed one by one, so RAM demanding should be stable and should not depend on a duration of the audio.

Presuming the host machine has a decent amount of disk space and RAM and they're prepared to wait for the results, could a user potentially let SN run for several hours and expect it to handle a recording of such length or is it only capable of dealing with shorter recordings?

In the settings, you can change "Listening mode" to "Always on". In this mode, SN always listens and transcribes. It tries to detect silence and process audio in chunks. RAM is freed after the chunk is processed. There is no any specific time limit. It should be able to run in this mode indefinitely.

Something like this maybe:

Thank you. It is perfect! I will definitely use it, but first I need to at least determine under what conditions these features are usable and under which they simply cannot work. I don't want to advertise half-baked functionalities.