fatal runtime error: stack overflow

docwisdom commented 10 months ago

Czkawka version 12.11.2 docker GUI
OS version unraid 6.12.4
Terminal output[optional]:

[xvnc ] Tue Nov 21 14:22:35 2023 [xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [xvnc ] Tue Nov 21 14:27:46 2023 [xvnc ] VNCSConnST: closing /tmp/vnc.sock: Clean disconnection [xvnc ] EncodeManager: Framebuffer updates: 1523 [xvnc ] EncodeManager: Tight: [xvnc ] EncodeManager: Solid: 34 rects, 1.23945 Mpixels [xvnc ] EncodeManager: 544 B (1:9114.32 ratio) [xvnc ] EncodeManager: Bitmap RLE: 18 rects, 13.809 kpixels [xvnc ] EncodeManager: 582 B (1:95.2784 ratio) [xvnc ] EncodeManager: Indexed RLE: 2.615 krects, 429.034 kpixels [xvnc ] EncodeManager: 409.938 KiB (1:4.16297 ratio) [xvnc ] EncodeManager: Tight (JPEG): [xvnc ] EncodeManager: Full Colour: 1.622 krects, 2.05636 Mpixels [xvnc ] EncodeManager: 3.23682 MiB (1:2.42922 ratio) [xvnc ] EncodeManager: Total: 4.289 krects, 3.73865 Mpixels [xvnc ] EncodeManager: 3.63822 MiB (1:3.93348 ratio) [xvnc ] Connections: closed: /tmp/vnc.sock [xvnc ] ComparingUpdateTracker: 24.3215 Mpixels in / 1.10949 Mpixels out [xvnc ] ComparingUpdateTracker: (1:21.9214 ratio) [xvnc ] Tue Nov 21 14:38:02 2023 [xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] Tue Nov 21 14:38:03 2023 [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [app ] thread '' has overflowed its stack [app ] fatal runtime error: stack overflow [supervisor ] service 'app' exited (got signal SIGABRT). [supervisor ] service 'app' exited, shutting down... [supervisor ] stopping service 'openbox'... [supervisor ] service 'openbox' exited (with status 0). [supervisor ] stopping service 'nginx'...

Bug Description When doing image comparison, fails just after hashing 180,000 images before showing results in GUI

qarmin commented 10 months ago

I tested on my local machine hashing 300 000 files, but not had any problems(it was just slow)

How much cores you have in cpu and which OS do you use?

docwisdom commented 10 months ago

I tried again multiple times with smaller file counts, like 1600 or so and had the same issue. I think it may have something to do with the gradient hashing versions. If I switch to blockhash it seems to do better in testing.

Im running on unraid which is based on slackware. Its in a docker container from this repo https://hub.docker.com/r/jlesage/czkawka/

docwisdom commented 10 months ago

I just tested blockhash on 8k photos and it crashed again.

[xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [xvnc ] ComparingUpdateTracker: 0 pixels in / 0 pixels out [xvnc ] ComparingUpdateTracker: (1:-nan ratio) [app ] 19:07:46.673 [INFO] czkawka_core::similar_images: find_similar_images [app ] 19:13:32.743 [INFO] czkawka_core::similar_images: find_similar_images: Done in 346.07s [app ] 19:15:14.553 [INFO] czkawka_core::similar_images: find_similar_images [app ] 19:17:06.562 [INFO] czkawka_core::similar_images: find_similar_images: Done in 112.01s [app ] 19:18:10.395 [INFO] czkawka_core::similar_images: find_similar_images [app ] thread '<unknown>' has overflowed its stack [app ] fatal runtime error: stack overflow

qarmin commented 10 months ago

Can you somehow add RUST_LOG=debug to environment variables of this app? This shows that stack overflow happens in similar image tool, but not shows exact function that cause problem.

By default most of linux distros have 8MB of stack which should be enough for this app, but slackware is quite old distribution and can have different limits(Looks that can have 1MB of stack size - https://slackwiki.com/Resource_Limits). Can you check what returns ulimit -s ? On my OS it returns 8192 [KB].

How much CPU/threads have server?

docwisdom commented 10 months ago

Unraid is a custom build, it uses an up to date kernel, but I dont know much about its inner workings.

root@NAS:~# ulimit -s unlimited

14 cores, 28 threads

[supervisor ] loading service 'openbox'... [supervisor ] loading service 'logmonitor'... [supervisor ] service 'logmonitor' is disabled. [supervisor ] loading service 'logrotate'... [supervisor ] all services loaded. [supervisor ] starting services... [supervisor ] starting service 'xvnc'... [xvnc ] Xvnc TigerVNC 1.13.1 - built Nov 10 2023 13:43:39 [xvnc ] Copyright (C) 1999-2022 TigerVNC Team and many others (see README.rst) [xvnc ] See https://www.tigervnc.org for information on TigerVNC. [xvnc ] Underlying X server release 12014000 [xvnc ] Thu Nov 23 12:57:23 2023 [xvnc ] vncext: VNC extension running! [xvnc ] vncext: Listening for VNC connections on /tmp/vnc.sock (mode 0660) [xvnc ] vncext: Listening for VNC connections on all interface(s), port 5900 [xvnc ] vncext: created VNC server for screen 0 [supervisor ] starting service 'nginx'... [nginx ] Listening for HTTP connections on port 5800. [supervisor ] starting service 'openbox'... [supervisor ] starting service 'app'... [supervisor ] all services started. [app ] 20:57:25.680 [INFO] czkawka_core::common: Czkawka version: 6.1.0, was compiled with release mode [app ] 20:57:26.264 [INFO] czkawka_gui: Set thread number to 28 [xvnc ] Thu Nov 23 12:58:01 2023 [xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [xvnc ] ComparingUpdateTracker: 0 pixels in / 0 pixels out [xvnc ] ComparingUpdateTracker: (1:-nan ratio) [app ] 20:58:11.345 [DEBUG] czkawka_gui::connect_things::connect_button_search: clean_tree_view [app ] 20:58:11.345 [DEBUG] czkawka_gui::connect_things::connect_button_search: clean_tree_view: Done in 1.04µs [app ] 20:58:11.347 [INFO] czkawka_core::similar_images: find_similar_images [app ] 20:58:11.348 [DEBUG] czkawka_core::similar_images: check_for_similar_images [app ] 20:58:11.584 [DEBUG] czkawka_core::common: send_info_and_wait_for_ending_all_threads [app ] 20:58:11.589 [DEBUG] czkawka_core::common: send_info_and_wait_for_ending_all_threads: Done in 4.86ms [app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: check_for_similar_images: Done in 241.04ms [app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: hash_images [app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: hash_images_load_cache [app ] 20:58:11.589 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized_by_path [app ] 20:58:11.589 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized [app ] 20:58:11.616 [DEBUG] czkawka_core::common_cache: Starting removing outdated cache entries (removing non existent files from cache - true) [app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Completed removing outdated cache entries, removed 0 out of all 3845 entries [app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Loaded cache from file cache_similar_images_32_Blockhash_Lanczos3_61.bin (or json alternative) - 3845 results [app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized: Done in 120.77ms [app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Converting cache Vec into BTreeMap<String, T> [app ] 20:58:11.712 [DEBUG] czkawka_core::common_cache: Converted cache Vec into BTreeMap<String, T> [app ] 20:58:11.712 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized_by_path: Done in 123.11ms [app ] 20:58:11.712 [DEBUG] czkawka_core::similar_images: hash_images-load_cache - starting calculating diff [app ] 20:58:11.738 [DEBUG] czkawka_core::similar_images: hash_images_load_cache - completed diff between loaded and prechecked files, 5383(15.41 GiB) - non cached, 3845(8.66 GiB) - already cached [app ] 20:58:11.738 [DEBUG] czkawka_core::similar_images: hash_images_load_cache: Done in 149.25ms [app ] 20:58:11.738 [DEBUG] czkawka_core::similar_images: hash_images - start hashing images [app ] thread ' < unknown > ' has overflowed its stack [app ] fatal runtime error: stack overflow [supervisor ] service 'app' exited (got signal SIGABRT). [supervisor ] service 'app' exited, shutting down... [supervisor ] stopping service 'openbox'... [supervisor ] service 'openbox' exited (with status 0). [supervisor ] stopping service 'nginx'... [xvnc ] Thu Nov 23 13:09:21 2023 [xvnc ] VNCSConnST: closing /tmp/vnc.sock: Clean disconnection [xvnc ] EncodeManager: Framebuffer updates: 4062 [xvnc ] EncodeManager: CopyRect: [xvnc ] EncodeManager: Copies: 1 rects, 182.7 kpixels [xvnc ] EncodeManager: 16 B (1:45675.8 ratio) [xvnc ] EncodeManager: Tight: [xvnc ] EncodeManager: Solid: 174 rects, 7.3631 Mpixels [xvnc ] EncodeManager: 2.71875 KiB (1:10579.9 ratio) [xvnc ] EncodeManager: Bitmap RLE: 99 rects, 74.165 kpixels [xvnc ] EncodeManager: 2.84766 KiB (1:102.143 ratio) [xvnc ] EncodeManager: Indexed RLE: 5.702 krects, 1.23611 Mpixels [xvnc ] EncodeManager: 866.635 KiB (1:5.64873 ratio) [xvnc ] EncodeManager: Tight (JPEG): [xvnc ] EncodeManager: Full Colour: 4.698 krects, 5.61785 Mpixels [xvnc ] EncodeManager: 8.92706 MiB (1:2.40663 ratio) [xvnc ] EncodeManager: Total: 10.674 krects, 14.4739 Mpixels [xvnc ] EncodeManager: 9.77884 MiB (1:5.65873 ratio) [xvnc ] Connections: closed: /tmp/vnc.sock [xvnc ] ComparingUpdateTracker: 135.499 Mpixels in / 7.93127 Mpixels out [xvnc ] ComparingUpdateTracker: (1:17.0842 ratio) [supervisor ] service 'nginx' exited (with status 0). [supervisor ] stopping service 'xvnc'... [xvnc ] Thu Nov 23 13:09:22 2023 [xvnc ] ComparingUpdateTracker: 0 pixels in / 0 pixels out [xvnc ] ComparingUpdateTracker: (1:-nan ratio) [supervisor ] service 'xvnc' exited (with status 0). [supervisor ] sending SIGTERM to all processes... [finish ] executing container finish scripts... [finish ] all container finish scripts executed.

qarmin commented 10 months ago

In hash_images function I cannot find any place that could use more than few kilobytes of stack, so I don't know why stack overflows.

Limiting used cores is probably the easiest workaround(I have 8 threads and never had similar problems, but I think that 15/20 should also works fine - but this needs to be tested).

docwisdom commented 10 months ago

I'll try that now

jlesage commented 10 months ago

Note that this version of Czkawka is compiled against musl, instead of glibc. The thread stack size allocated by musl is 128K by default, which is small compared to few MB by glibc (https://wiki.musl-libc.org/functional-differences-from-glibc.html).

docwisdom commented 10 months ago

Sorry this is beyond my comprehension. Is there a fix?

jlesage commented 10 months ago

The comment was for @qarmin, so he can see if currently Czkawka could approach the thread stack size limit of musl.

jlesage commented 10 months ago

@docwisdom, to see if it's a stack size issue, could you try to run the following commands inside the container? This will increase the default stack size to 1MB.

export GOPATH=/go
add-pkg go git musl-dev
go install github.com/yaegashi/muslstack@latest
cp /usr/bin/czkawka_gui /usr/bin/czkawka_gui2
/go/bin/muslstack -s 0x100000 /usr/bin/czkawka_gui2
mv /usr/bin/czkawka_gui2 /usr/bin/czkawka_gui

Then restart the container and see if it's crashing again. If it does, you can try to increase to 8MB:

cp /usr/bin/czkawka_gui /usr/bin/czkawka_gui2
/go/bin/muslstack -s 0x800000 /usr/bin/czkawka_gui2
mv /usr/bin/czkawka_gui2 /usr/bin/czkawka_gui

docwisdom commented 10 months ago

Thanks for this. I tried both 1mb and 8mb settings and still had it crash at the end of the hashing

qarmin commented 10 months ago

From - https://stackoverflow.com/questions/44003589/how-to-increase-the-stack-size-available-to-a-rust-library#comment75039223_44003965:

I'll note that [std::thread::Builder](https://doc.rust-lang.org/1.8.0/std/thread/struct.Builder.html) let you specify the stack size of the created thread from within the program. Only the stack size of the main thread is set by the OS.

so it is possible that thread stack size was set here and that is why this not worked(main thread in gui is not responsible for heavy calculation).

I already tried to set stack size in rayon with https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html#method.stack_size to 1 byte to see crash, but everything worked fine, so not sure where problem can be.

I tried to debug stack size with https://crates.io/crates/cargo-call-stack, but looks that it is not possible due several crashes and I don't know which other tool I could use to debug this problem.

qarmin commented 10 months ago

In https://github.com/qarmin/czkawka/pull/1102 I changed some stack size values which may fix problem, but for me this values just works, so I cannot test if this will fix problem:

main thread stack size - os default values - not too much to calculate/store from gui perspective
main scan thread stack size - 8MB
worker threads stack size - 4MB

docwisdom commented 10 months ago

Will this be an upcomming release?

qarmin commented 10 months ago

Yes, even now binaries to test are available here - https://github.com/qarmin/czkawka/actions/runs/6992056327, but since they are build with glibc not with musl, by running it, it is impossible to tell if the problem has been fixed

jlesage commented 10 months ago

I've reproduced the stack overflow error. I'm currently testing a version that sets the stack size in rayon. I will let you know about the result.

docwisdom commented 10 months ago

Thank you

jlesage commented 10 months ago

Finally, I don't seem to be able to reproduce in a consistent manner. @docwisdom, could your try the jlesage/czkawka:issue-1140 Docker image and see if you can reproduce ?

docwisdom commented 10 months ago

Ive done 3 test batches so far (3-6k each) and no crashes. I think this may have resolved the issue. I am going to do a larger batch this morning.

docwisdom commented 10 months ago

Ran 160,000 photo comparisons using 32 gradient and it completed successfully. I would consider the issue resolved. Thank you.

jlesage commented 10 months ago

@qarmin, this is the patch that @docwisdom tested:

--- a/czkawka_core/src/common.rs        2023-11-24 14:45:40.462095198 -0500
+++ b/czkawka_core/src/common.rs        2023-11-24 14:47:29.678337169 -0500
@@ -76,7 +76,7 @@
 pub fn set_number_of_threads(thread_number: usize) {
     NUMBER_OF_THREADS.set(thread_number);

-    rayon::ThreadPoolBuilder::new().num_threads(get_number_of_threads()).build_global().unwrap();
+    rayon::ThreadPoolBuilder::new().num_threads(get_number_of_threads()).stack_size(8*1024*1204).build_global().unwrap();
 }

 pub const RAW_IMAGE_EXTENSIONS: &[&str] = &[

Do you want to integrate the change yourself or you want me to create a PR ?

qarmin commented 10 months ago

I already added a little different limits - https://github.com/qarmin/czkawka/issues/1140#issuecomment-1826019763

jlesage commented 10 months ago

Ok yes, in this PR, 4MB (DEFAULT_WORKER_THREAD_SIZE) is used instead of 8MB. @docwisdom, I pushed jlesage/czkawka:issue-1140-2, if you want to confirm that it's still working with a 4MB stack.

docwisdom commented 10 months ago

tested on 3900 photos, no issues

nicoKoehler commented 7 months ago

@jlesage @qarmin hey, is this rolled up into the available docker image too? I am experiencing the same issue when I run anything other than the standard selected algorithm for similar images. Tested on 3 machines of varying CPU strength, all resulting in the same issue. (pulled jlesage/czkawka image via docker compose)

EDIT: I went to dockerhub and saw that the "latest" tag is not in fact the lastest. There is an image with the tag 1140-2, which was commited around the same time as this issue was closed. I tried to use this image instead, and thus far (on smaller tests with 3-5k images which also caused the stackoverflow in the "latest" image) it's been working.

EDIT2: Still causes a stack overflow with 11k pictures.

jlesage commented 7 months ago

is this rolled up into the available docker image too?

The latest version of Czkawka doesn't have the fix. The next version should include it.

I went to dockerhub and saw that the "latest" tag is not in fact the lastest. There is an image with the tag 1140-2, which was commited around the same time as this issue was closed.

This was a non-official image to test a potential fix.

EDIT2: Still causes a stack overflow with 11k pictures.

Can you try jlesage/czkawka:issue-1140 instead ?

qarmin / czkawka

fatal runtime error: stack overflow #1140