mkiol / dsnote

Speech Note Linux app. Note taking, reading and translating with offline Speech to Text, Text to Speech and Machine translation.
Mozilla Public License 2.0
404 stars 19 forks source link

flatpak v4.5.0 won't start showing `std::runtime error pa failed` #138

Closed h9j6k closed 2 months ago

h9j6k commented 2 months ago

Hello,

I upgraded dsnote via flatpak to v4.5.0 on Debian 12

GPU: nvidia Quadro K2200 CPU: Xeon E3-1275L v3

when I ran flatpak run net.mkiol.SpeechNote --verbose, dsnote exit without even GUI showing up

QIBusPlatformInputContext: invalid portal bus.
[I] 07:41:09.765095217.765 0x7f86fb5c8d00 init:49 - logging to stderr enabled
[D] 07:41:09.765161937.765 0x7f86fb5c8d00 () - version: 4.5.0
[D] 07:41:09.765471932.765 0x7f86fb5c8d00 parse_cpuinfo:117 - cpu flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d
[D] 07:41:09.765603113.765 0x7f86fb5c8d00 parse_cpuinfo:125 - cpuinfo: processor-count=8, flags=[avx, avx2, fma, f16c, ]
[D] 07:41:09.765675993.765 0x7f86fb5c8d00 () - translation: "en_US"
[W] 07:41:09.765691168.765 0x7f86fb5c8d00 () - failed to install translation
[D] 07:41:09.765695517.765 0x7f86fb5c8d00 () - starting standalone app
[D] 07:41:09.766341820.766 0x7f86fb5c8d00 () - app: net.mkiol dsnote
[D] 07:41:09.766351489.766 0x7f86fb5c8d00 () - config location: "/home/user/.var/app/net.mkiol.SpeechNote/config"
[D] 07:41:09.766366946.766 0x7f86fb5c8d00 () - data location: "/home/user/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote"
[D] 07:41:09.766372189.766 0x7f86fb5c8d00 () - cache location: "/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote"
[D] 07:41:09.766376388.766 0x7f86fb5c8d00 () - settings file: "/home/user/.var/app/net.mkiol.SpeechNote/config/net.mkiol/dsnote/settings.conf"
[D] 07:41:09.766380424.766 0x7f86fb5c8d00 () - platform: "xcb"
[D] 07:41:09.766396883.766 0x7f86fb5c8d00 () - enforcing num threads: 0
[D] 07:41:09.788725078.788 0x7f86fb5c8d00 () - starting service: app-standalone
[D] 07:41:09.790285853.790 0x7f86fb5c8d00 () - mbrola dir: "/app/bin"
[D] 07:41:09.790329649.790 0x7f86fb5c8d00 () - espeak dir: "/app/bin"
[D] 07:41:09.790606477.790 0x7f86fb5c8d00 () - module checksum missing, need to unpack: "rhvoicedata"
[D] 07:41:09.790614930.790 0x7f86fb5c8d00 () - unpacking module: "rhvoicedata"
[D] 07:41:09.790640225.790 0x7f86e7fff680 loop:88 - py executor loop started
[D] 07:41:09.790655542.790 0x7f86e7fff680 set_env:84 - set env: PYTHONIOENCODING = utf-8
[D] 07:41:09.790660501.790 0x7f86e7fff680 set_env:84 - set env: HF_HUB_DISABLE_TELEMETRY = 1
[D] 07:41:09.790664458.790 0x7f86e7fff680 set_env:84 - set env: HF_HUB_OFFLINE = 1
[D] 07:41:09.790668523.790 0x7f86e7fff680 set_env:84 - set env: HF_HUB_LOCAL_DIR_AUTO_SYMLINK_THRESHOLD = 100000000000
[D] 07:41:09.790686837.790 0x7f86e7fff680 set_env:84 - set env: HF_HUB_CACHE = /home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote
[D] 07:41:09.801328279.801 0x7f86f4b68680 () - config version: 81 81
[D] 07:41:09.802339850.802 0x7f86e7fff680 libs_availability:62 - checking: torch cuda
[D] 07:41:09.811157155.811 0x7f86fb5c8d00 () - extracting xz archive: "/app/share/dsnote/rhvoicedata.tar.xz"
[D] 07:41:09.851717083.851 0x7f86f4b68680 () - models changed
[D] 07:41:10.022097482.22 0x7f86fb5c8d00 () - xz decoded, stats: size= 5778164 , duration= 210 , threads= 6
[D] 07:41:10.022141781.22 0x7f86fb5c8d00 () - extracting archive: "/home/user/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/rhvoicedata.tar"
[D] 07:41:10.114052214.114 0x7f86fb5c8d00 () - module successfully unpacked: "rhvoicedata"
[D] 07:41:10.118235086.118 0x7f86fb5c8d00 () - module already unpacked: "rhvoicedata"
[D] 07:41:10.118261635.118 0x7f86fb5c8d00 () - module checksum missing, need to unpack: "rhvoiceconfig"
[D] 07:41:10.118266445.118 0x7f86fb5c8d00 () - unpacking module: "rhvoiceconfig"
[D] 07:41:10.118574741.118 0x7f86fb5c8d00 () - extracting xz archive: "/app/share/dsnote/rhvoiceconfig.tar.xz"
[D] 07:41:10.119089077.119 0x7f86fb5c8d00 () - xz decoded, stats: size= 2348 , duration= 0 , threads= 6
[D] 07:41:10.119127756.119 0x7f86fb5c8d00 () - extracting archive: "/home/user/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/rhvoiceconfig.tar"
[D] 07:41:10.119425827.119 0x7f86fb5c8d00 () - module successfully unpacked: "rhvoiceconfig"
[D] 07:41:10.119502128.119 0x7f86fb5c8d00 () - module already unpacked: "rhvoiceconfig"
[D] 07:41:10.119509858.119 0x7f86fb5c8d00 () - module checksum missing, need to unpack: "espeakdata"
[D] 07:41:10.119525601.119 0x7f86fb5c8d00 () - unpacking module: "espeakdata"
[D] 07:41:10.130024040.130 0x7f86fb5c8d00 () - extracting xz archive: "/app/share/dsnote/espeakdata.tar.xz"
[D] 07:41:10.586860373.586 0x7f86fb5c8d00 () - xz decoded, stats: size= 6744252 , duration= 456 , threads= 6
[D] 07:41:10.586905326.586 0x7f86fb5c8d00 () - extracting archive: "/home/user/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/espeakdata.tar"
[D] 07:41:10.618563188.618 0x7f86fb5c8d00 () - module successfully unpacked: "espeakdata"
[D] 07:41:10.623464519.623 0x7f86fb5c8d00 () - module already unpacked: "espeakdata"
[D] 07:41:10.624690341.624 0x7f86fb5c8d00 () - default stt model not found: "en"
[D] 07:41:10.624717629.624 0x7f86fb5c8d00 () - default tts model not found: "en"
[D] 07:41:10.624722423.624 0x7f86fb5c8d00 () - default mnt lang not found: "en"
[D] 07:41:10.624727173.624 0x7f86fb5c8d00 () - new default mnt lang: "en"
[D] 07:41:10.624734147.624 0x7f86fb5c8d00 () - service refresh status, new state: busy
[D] 07:41:10.624738310.624 0x7f86fb5c8d00 () - service state changed: unknown => busy
[D] 07:41:10.624742658.624 0x7f86fb5c8d00 () - delaying features availability
[D] 07:41:10.626888462.626 0x7f86fb5c8d00 () - runtime prefix: "/app"
[D] 07:41:10.627216718.627 0x7f86fb5c8d00 () - available styles: ("Default", "Fusion", "Imagine", "Material", "org.kde.breeze", "org.kde.desktop", "Plasma", "Universal")
[D] 07:41:10.627320089.627 0x7f86fb5c8d00 () - style paths: ("/usr/lib/qml/QtQuick/Controls.2")
[D] 07:41:10.627343398.627 0x7f86fb5c8d00 () - import paths: ("/usr/lib/qml", "/app/bin", "qrc:/qt-project.org/imports")
[D] 07:41:10.627349247.627 0x7f86fb5c8d00 () - library paths: ("/usr/share/runtime/lib/plugins", "/usr/lib/plugins", "/app/bin")
[D] 07:41:10.627354381.627 0x7f86fb5c8d00 () - using auto qt style
[D] 07:41:10.627358429.627 0x7f86fb5c8d00 () - no XDG_CURRENT_DESKTOP
[D] 07:41:10.627362156.627 0x7f86fb5c8d00 () - switching to style: "org.kde.breeze"
[D] 07:41:10.627586706.627 0x7f86fb5c8d00 () - desktop file: "net.mkiol.SpeechNote"
[D] 07:41:11.138533445.138 0x7f86e7fff680 libs_availability:70 - checking: coqui tts
[D] 07:41:11.138854704.138 0x7f86e7fff680 libs_availability:78 - checking: whisperspeech tts
[D] 07:41:11.139067295.139 0x7f86e7fff680 libs_availability:86 - checking: faster-whisper
[D] 07:41:11.540331218.540 0x7f86e7fff680 libs_availability:94 - checking: transformers
[D] 07:41:11.540375578.540 0x7f86e7fff680 libs_availability:96 - checking: accelerate
[D] 07:41:11.612810574.612 0x7f86fb5c8d00 state_pa_callback:42 - pa failed
terminate called after throwing an instance of 'std::runtime_error'
  what():  pa failed
mkiol commented 2 months ago

Thanks for the report.

This is an error from pulse-audio API integration added in v4.5.0.

Does your system have a pulse-audio or pipewire server?

h9j6k commented 2 months ago

Thanks for pointing me to pulseaudio I had no clue what pa meant previously.

I did a quick search on google and came up with a hack where I have to manually do the following each time: First kill any PA processes before launching dsnote and start PA with debug mode, then add env var PULSE_SERVER to flatpak run i.e.,

$pulseaudio -k 
$pulseaudio -v 
$PULSE_SERVER=/run/user/$(id -u)/pulse/native flatpak run net.mkiol.SpeechNote --verbose

This allows me to launch dsnote, however is there a way that dsnote can take care of this?

h9j6k commented 2 months ago

Now there is a new situation. With Addon.Nvidia installed, if I try to transcribe an audio with faster-whisper, it exits with invalid task id

[D] 13:04:34.037341885.37 0x7fcb0f586d00 () - opening file: "/home/user/Downloads/audio.m4a" -1
[D] 13:04:34.037376973.37 0x7fcb0f586d00 init_av_in_format:594 - opening file: /home/user/Downloads/audio.m4a
[D] 13:04:34.115270445.115 0x7fcb0f586d00 () - "audio=[[index=0, media-type=audio, title=, lang=und], ], video=], subtitles="
[D] 13:04:34.115322668.115 0x7fcb0f586d00 () - requested stream index for transcribe: 0
[D] 13:04:34.115347012.115 0x7fcb0f586d00 () - stt transcribe file
[D] 13:04:34.116419091.116 0x7fcb0f586d00 () - default tts model not found: "en"
[D] 13:04:34.116429944.116 0x7fcb0f586d00 () - default mnt lang not found: "en"
[D] 13:04:34.116434694.116 0x7fcb0f586d00 () - new default mnt lang: "en"
[D] 13:04:34.116442408.116 0x7fcb0f586d00 () - choosing model for id: "de_fasterwhisper_large3" "en"
[D] 13:04:34.116463339.116 0x7fcb0f586d00 () - gpu device str: ("CUDA", " 0", " Quadro K2200")
[D] 13:04:34.116481995.116 0x7fcb0f586d00 () - restart stt engine config: "lang=de, lang_code=, model-files=[model-file=/home/user/.var/app/net.mkiol.SpeechNote/cache/net.mkiol/dsnote/speech-models/multilang_fasterwhisper_large3, scorer-file=, ttt-model-file=], speech-mode=automatic, vad-mode=aggressiveness-3, speech-started=0, text-format=subrip, options=t, use-gpu=1, gpu-device=[id=0, api=cuda, name=Quadro K2200, platform-name=], sub-config=[min-segment-dur=4, min-line-length=0, max-line-length=0]"
[D] 13:04:34.116489723.116 0x7fcb0f586d00 () - new stt engine required
[D] 13:04:34.121257343.121 0x7fcb0f586d00 start:224 - stt start
[D] 13:04:34.121346781.121 0x7fcb0f586d00 start:234 - stt start completed
[D] 13:04:34.121373437.121 0x7fcb0f586d00 () - requested stream index: 0
[D] 13:04:34.121377712.121 0x7fc95bd9e680 process:283 - stt processing started
[D] 13:04:34.121401210.121 0x7fcb0f586d00 () - creating audio source
[D] 13:04:34.121403130.121 0x7fc95bd9e680 set_state:469 - stt state: idle => initializing
[D] 13:04:34.121408682.121 0x7fc95bd9e680 set_state:476 - speech detection status: no-speech => initializing (no-speech)
[D] 13:04:34.121415379.121 0x7fcb0f586d00 decompress_to_data_raw_async:1381 - task decompress to data raw async
[D] 13:04:34.121421565.121 0x7fc95bd9e680 create_model:91 - creating fasterwhisper model
[D] 13:04:34.121433162.121 0x7fc95bd9e680 execute:55 - task pushed
[D] 13:04:34.121442775.121 0x7fcb0f586d00 init_av_in_format:594 - opening file: /home/user/Downloads/audio.m4a
[D] 13:04:34.121473424.121 0x7fcb08bd9680 loop:130 - py task execution: start
[D] 13:04:34.121862019.121 0x7fcb08bd9680 operator():101 - cpu info: arch=x86_64, cores=8
[D] 13:04:34.121883966.121 0x7fcb08bd9680 operator():103 - using threads: 8/8
[D] 13:04:34.121890705.121 0x7fcb08bd9680 operator():105 - using device: cuda 0
[D] 13:04:34.122995551.122 0x7fcb0f586d00 init_av_in_format:688 - stream index requested => selecting stream: 0
[D] 13:04:34.123004810.123 0x7fcb0f586d00 init_av:744 - input codec: aac (86018)
[D] 13:04:34.123008437.123 0x7fcb0f586d00 init_av:748 - requested out format: unknown
[D] 13:04:34.123240519.123 0x7fcb0f586d00 init_av:872 - encoder name: pcm_s16le
[D] 13:04:34.123250385.123 0x7fcb0f586d00 init_av:1035 - decoder frame-size: 2048
[D] 13:04:34.123253177.123 0x7fcb0f586d00 init_av:1038 - encoder frame-size: 0
[D] 13:04:34.123255777.123 0x7fcb0f586d00 init_av:1040 - time-base change: 1/48000 => 1/16000
[D] 13:04:34.123258380.123 0x7fcb0f586d00 init_av:1049 - sample-format change: fltp => s16
[D] 13:04:34.123260782.123 0x7fcb0f586d00 init_av:1051 - sample-rate change: 48000 => 16000
[D] 13:04:34.124531576.124 0x7fcb0f586d00 init_av_filter:394 - filter src args: sample_rate=48000:sample_fmt=fltp:time_base=1/48000:channel_layout=stereo
[D] 13:04:34.211255296.211 0x7fcb0f586d00 init_av:1110 - output format: 
[D] 13:04:34.211448642.211 0x7fc9527fc680 operator():1361 - process started
[D] 13:04:34.211639247.211 0x7fcb0f586d00 () - service refresh status, new state: transcribing-file
[D] 13:04:34.211652041.211 0x7fcb0f586d00 () - service state changed: idle => transcribing-file
[D] 13:04:34.211662457.211 0x7fcb0f586d00 () - task state changed: 0 => 3
[D] 13:04:34.211672713.211 0x7fcb0f586d00 () - import file result: ok-import-audio
[D] 13:04:34.211929764.211 0x7fcb0f586d00 () - service refresh status, new state: transcribing-file
[D] 13:04:34.211948028.211 0x7fcb0f586d00 () - transcribe progress: 0
[D] 13:04:34.211957016.211 0x7fcb0f586d00 () - app current task: -1 => 0
[D] 13:04:34.211964904.211 0x7fcb0f586d00 () - app task state: idle => initializing
[D] 13:04:34.212564529.212 0x7fcb0f586d00 () - app service state: idle => transcribing-file
[W] 13:04:34.217262705.217 0x7fcb0f586d00 () - no available mnt langs
[W] 13:04:34.217283144.217 0x7fcb0f586d00 () - no available mnt out langs
[W] 13:04:34.217287436.217 0x7fcb0f586d00 () - no available tts models for in mnt
[W] 13:04:34.217290581.217 0x7fcb0f586d00 () - no available tts models for out mnt
[W] 13:04:34.217295050.217 0x7fcb0f586d00 () - invalid task id
mkiol commented 2 months ago

PULSE_SERVER=/run/user/$(id -u)/pulse/native flatpak run net.mkiol.SpeechNote --verbose This allows me to launch dsnote, however is there a way that dsnote can take care of this?

I'm very glad you were able to find a workaround 👍🏿 I want to better understand why this problem occurred. I've tried fresh Debian 12 + GNOME installation and didn't have this issue. Did you make any significant changes in audio server on your system? How can I reproduce this problem?

mkiol commented 2 months ago

With Addon.Nvidia installed, if I try to transcribe an audio with faster-whisper, it exits

Did it work in the previous version? In the new version 4.5.0, CUDA runtime has been updated from 12.2.2 to 12.4.0, so this is very minor update and I don't think it could break anything. Does the problem also occur with "Whisper" models (not Faster Whisper)?

h9j6k commented 2 months ago

I've tried fresh Debian 12 + GNOME installation

You are awesome!!!

Did you make any significant changes in audio server on your system? How can I reproduce this problem?

Ah, in the first posted log, there was one line [D] 07:41:09.766380424.766 0x7f86fb5c8d00 () - platform: "xcb"

I am not sure the xcb platform is the same as in Gnome verbose info, since I am not using any of those heavy desktop environments. The GUI currently run on my workstation is suckless tools combo dmenu+dwm+ssterm

https://tools.suckless.org

maybe that's the reason why PA behind-the-scene setting has not been picked up by dwm on my workstation as yours in Gnome did?

Does the problem also occur with "Whisper" models (not Faster Whisper)?

Before the 4.5.0 upgrade, on 4.4.0 faster-whisper models didn't work at all, either large v3 or medium.

On 4.5.0 though, I just noticed one interesting thing, If I use faster-whisper medium model, it works just fine, only when switching to large v3, it would exit without saying anything useful in verbose log.

As for whisper models, they are always working, large or medium, on both 4.4.0 and 4.5.0.

Could this be a RAM issue of not having enough RAM?? (I have only 16GB ECC RAM)

But it does not make much sense, faster-whisper models are supposed to need less RAM than whisper models..

Edit: grammar, on --> one, modes --> models

mkiol commented 2 months ago

Ah, in the first posted log, there was one line [D] 07:41:09.766380424.766 0x7f86fb5c8d00 () - platform: "xcb"

This xcb is just an indicator that you are on X11. Nothing to worry about.

The GUI currently run on my workstation is suckless tools combo dmenu+dwm+ssterm https://tools.suckless.org/ maybe that's the reason why PA behind-the-scene setting has not been picked up by dwm on my workstation as yours in Gnome did?

Thanks for the hint. I will investigate it.

But it does not make much sense, faster-whisper models are supposed to need less RAM than whisper models..

Actually, "Whisper" models are implemented in Speech Note with whisper.cpp library which is optimized for minimal RAM and CPU use. It is a bit confusing because of naming, but Speech Note does not use original "Whisper" implementation and models from OpenAI. It uses only optimized versions. Both "Whisper" and "Faster Whisper" are almost equally efficient.

h9j6k commented 2 months ago

Thanks for the clarification on whisper/faster-whisper/openai whisper models. Then would it be better to show whisper as whisper.cpp instead of whisper only in dsnote menu?

Also on faster-whisper models, I found that on 4.5.0, actually medium faster-whisper model does not work either, I noticed this line from the log,

[E] 02:46:00.019774119.19 0x7fa21f7fe680 operator():340 - fasterwhisper py error: RuntimeError: cuDNN failed with status CUDNN_STATUS_ALLOC_FAILED

I mistook working OpenCL transcription for CUDA yesterday, i.e., it was CPU transcription with faster-whisper that worked, GPU transcription with faster-whisper didn't.

Could it be that GPU does not have enough VRAM? My Quadro K2200 only has 4GB VRAM onboard :(

Edit: I tried faster-whisper small model, it works and there is one line from log

[2024-05-20 12:02:24.533] [ctranslate2] [thread 9] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.

So I think the culprit is the GPU not having enough VRAM, maybe CUDA GPU w/ at least 8GB VRAM such as Quadro T1000 etc. is needed for large v3 faster-whisper??

Edit2: there is a mention of CUDA compute capability 6.1 being required for INT8, mine supports only 5.0

https://github.com/SYSTRAN/faster-whisper/issues/42#issuecomment-1510421230

mkiol commented 2 months ago

Then would it be better to show whisper as whisper.cpp instead of whisper only in dsnote menu?

Definitely. I have to rename them. The name is "Whisper" because initially only one "Whisper" engine, based on whisper.cpp, was implemented. Then I added "Faster Whisper" and this confusion was created.

Also on faster-whisper models, I found that on 4.5.0, actually medium faster-whisper model RuntimeError: cuDNN failed with status CUDNN_STATUS_ALLOC_FAILED Could it be that GPU does not have enough VRAM? My Quadro K2200 only has 4GB VRAM onboard :(

Edit: I tried faster-whisper small model, it works and there is one line from log

[2024-05-20 12:02:24.533] [ctranslate2] [thread 9] [warning] The compute type inferred from the saved model is int8_float32, but the target device or backend do not support efficient int8_float32 computation. The model weights have been automatically converted to use the float32 compute type instead.

So I think the culprit is the GPU not having enough VRAM, maybe CUDA GPU w/ at least 8GB VRAM such as Quadro T1000 etc. is needed for large v3 faster-whisper??

Edit2: there is a mention of CUDA compute capability 6.1 being required for INT8, mine supports only 5.0

This is very interesting. All "Faster Whisper" models in Speech Note are INT8 because they are most efficient on CPU. From my tests, on GPU, difference between INT8 and F16 i minimal. Maybe your card is different.

Would you be able to test F19 models from this site: https://huggingface.co/guillaumekln/faster-whisper-medium.en. To add them to Speech Note you just need to manually edit ~/.var/app/net.mkiol.SpeechNote/data/net.mkiol/dsnote/models.json file and add the following entry:

        {
            "name": "English (FasterWhisper Medium F16)",
            "model_id": "en_fasterwhisper_medium_f16",
            "engine": "stt_fasterwhisper",
            "lang_id": "en",
            "checksum": "a9514015",
            "checksum_quick": "97c57278",
            "size": "1530465940",
            "comp": "dir",
            "urls": [
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/model.bin",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/config.json",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/tokenizer.json",
                "https://huggingface.co/guillaumekln/faster-whisper-medium.en/resolve/83a3b718775154682e5f775bc5d5fc961d2350ce/vocabulary.txt"
            ]
        }

After restarting, you should be able to download and test "English (FasterWhisper Medium F16)" model.

h9j6k commented 2 months ago

All "Faster Whisper" models in Speech Note are INT8 because they are most efficient on CPU. From my tests, on GPU, difference between INT8 and F16 is minimal

My guess would be on GPU it is the opposite that of on CPU, from what I read, only high end and more recent CUDA GPUs can speed up f16/f8/f4 computing (because of having tensor cores??), however most of CUDA GPUs can do normal float32 (because CUDA cores can handle them). Maybe the fact in your testing difference is minimal is due to f16 has not been taken advantage of in computing, e.g., if you use an RTX 4000 SFF Ada (too expensive!!!) or similar there could be a noticeable difference??

Would you be able to test F16 models from this site: https://huggingface.co/guillaumekln/faster-whisper-medium.en.

Unfortunately, it crashes transcription with medium f16 model, I think the same reason applies, my GPU 4GB VRAM is not big enough (maybe it's time to look for another CUDA GPU w/ 8GB VRAM). K2200 can only handle small faster whisper model and when I use nvidia-smi, I saw that even with small model, dsnote cached 2.7GB VRAM. Also even when transcription is complete, it won't release the VRAM it cached. I think it is due to how ctranslate2 cache works, which is quite different from whisper.cpp.

https://opennmt.net/CTranslate2/environment_variables.html

CT2_CUDA_ALLOCATOR Allocating memory on the GPU with cudaMalloc is costly and is best avoided in high-performance code. For this reason CTranslate2 integrates caching allocators which enable a fast reuse of previously allocated buffers. The following allocators are integrated:

cuda_malloc_async (default for CUDA >= 11.2) Uses the [asynchronous allocator with memory pools] introduced in CUDA 11.2.

cub_caching (default for CUDA < 11.2) Uses the caching allocator from the [CUB project]

When I use whisper.cpp, not only can it handle large v3 model (with large v3, it uses VRAM about 2GB), when transcription is done, it also releases the VRAM.

At the moment whisper.cpp large v3 and faster-whisper small are my go-to models (for accuracy and speed respectively).

mkiol commented 2 months ago

Unfortunately, it crashes transcription with medium f16 model, I think the same reason applies, my GPU 4GB VRAM is not big enough I think it is due to how ctranslate2 cache works, which is quite different from whisper.cpp.

Thanks for your analysis and tests. This is an additional reason why whisper.cpp may be a better choice than faster-whisper in general.

At the moment whisper.cpp large v3 and faster-whisper small are my go-to models (for accuracy and speed respectively).

Cool :)

mkiol commented 2 months ago

I think this issue can be closed now.