Closed docwisdom closed 10 months ago
I tested on my local machine hashing 300 000 files, but not had any problems(it was just slow)
How much cores you have in cpu and which OS do you use?
I tried again multiple times with smaller file counts, like 1600 or so and had the same issue. I think it may have something to do with the gradient hashing versions. If I switch to blockhash it seems to do better in testing.
Im running on unraid which is based on slackware. Its in a docker container from this repo https://hub.docker.com/r/jlesage/czkawka/
I just tested blockhash on 8k photos and it crashed again.
[xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [xvnc ] ComparingUpdateTracker: 0 pixels in / 0 pixels out [xvnc ] ComparingUpdateTracker: (1:-nan ratio) [app ] 19:07:46.673 [INFO] czkawka_core::similar_images: find_similar_images [app ] 19:13:32.743 [INFO] czkawka_core::similar_images: find_similar_images: Done in 346.07s [app ] 19:15:14.553 [INFO] czkawka_core::similar_images: find_similar_images [app ] 19:17:06.562 [INFO] czkawka_core::similar_images: find_similar_images: Done in 112.01s [app ] 19:18:10.395 [INFO] czkawka_core::similar_images: find_similar_images [app ] thread '<unknown>' has overflowed its stack [app ] fatal runtime error: stack overflow
Can you somehow add RUST_LOG=debug
to environment variables of this app?
This shows that stack overflow happens in similar image tool, but not shows exact function that cause problem.
By default most of linux distros have 8MB of stack which should be enough for this app, but slackware is quite old distribution and can have different limits(Looks that can have 1MB of stack size - https://slackwiki.com/Resource_Limits).
Can you check what returns ulimit -s
? On my OS it returns 8192 [KB].
How much CPU/threads have server?
Unraid is a custom build, it uses an up to date kernel, but I dont know much about its inner workings.
root@NAS:~# ulimit -s unlimited
14 cores, 28 threads
[supervisor ] loading service 'openbox'...
[supervisor ] loading service 'logmonitor'...
[supervisor ] service 'logmonitor' is disabled.
[supervisor ] loading service 'logrotate'...
[supervisor ] all services loaded.
[supervisor ] starting services...
[supervisor ] starting service 'xvnc'...
[xvnc ] Xvnc TigerVNC 1.13.1 - built Nov 10 2023 13:43:39
[xvnc ] Copyright (C) 1999-2022 TigerVNC Team and many others (see README.rst)
[xvnc ] See https://www.tigervnc.org for information on TigerVNC.
[xvnc ] Underlying X server release 12014000
[xvnc ] Thu Nov 23 12:57:23 2023
[xvnc ] vncext: VNC extension running!
[xvnc ] vncext: Listening for VNC connections on /tmp/vnc.sock (mode 0660)
[xvnc ] vncext: Listening for VNC connections on all interface(s), port 5900
[xvnc ] vncext: created VNC server for screen 0
[supervisor ] starting service 'nginx'...
[nginx ] Listening for HTTP connections on port 5800.
[supervisor ] starting service 'openbox'...
[supervisor ] starting service 'app'...
[supervisor ] all services started.
[app ] 20:57:25.680 [INFO] czkawka_core::common: Czkawka version: 6.1.0, was compiled with release mode
[app ] 20:57:26.264 [INFO] czkawka_gui: Set thread number to 28
[xvnc ] Thu Nov 23 12:58:01 2023
[xvnc ] Connections: accepted: /tmp/vnc.sock
[xvnc ] SConnection: Client needs protocol version 3.8
[xvnc ] SConnection: Client requests security type None(1)
[xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888
[xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888
[xvnc ] ComparingUpdateTracker: 0 pixels in / 0 pixels out
[xvnc ] ComparingUpdateTracker: (1:-nan ratio)
[app ] 20:58:11.345 [DEBUG] czkawka_gui::connect_things::connect_button_search: clean_tree_view
[app ] 20:58:11.345 [DEBUG] czkawka_gui::connect_things::connect_button_search: clean_tree_view: Done in 1.04µs
[app ] 20:58:11.347 [INFO] czkawka_core::similar_images: find_similar_images
[app ] 20:58:11.348 [DEBUG] czkawka_core::similar_images: check_for_similar_images
[app ] 20:58:11.584 [DEBUG] czkawka_core::common: send_info_and_wait_for_ending_all_threads
[app ] 20:58:11.589 [DEBUG] czkawka_core::common: send_info_and_wait_for_ending_all_threads: Done in 4.86ms
[app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: check_for_similar_images: Done in 241.04ms
[app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: hash_images
[app ] 20:58:11.589 [DEBUG] czkawka_core::similar_images: hash_images_load_cache
[app ] 20:58:11.589 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized_by_path
[app ] 20:58:11.589 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized
[app ] 20:58:11.616 [DEBUG] czkawka_core::common_cache: Starting removing outdated cache entries (removing non existent files from cache - true)
[app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Completed removing outdated cache entries, removed 0 out of all 3845 entries
[app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Loaded cache from file cache_similar_images_32_Blockhash_Lanczos3_61.bin (or json alternative) - 3845 results
[app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: load_cache_from_file_generalized: Done in 120.77ms
[app ] 20:58:11.710 [DEBUG] czkawka_core::common_cache: Converting cache Vec
In hash_images function I cannot find any place that could use more than few kilobytes of stack, so I don't know why stack overflows.
Limiting used cores is probably the easiest workaround(I have 8 threads and never had similar problems, but I think that 15/20 should also works fine - but this needs to be tested).
I'll try that now
Note that this version of Czkawka is compiled against musl, instead of glibc. The thread stack size allocated by musl is 128K by default, which is small compared to few MB by glibc (https://wiki.musl-libc.org/functional-differences-from-glibc.html).
Sorry this is beyond my comprehension. Is there a fix?
The comment was for @qarmin, so he can see if currently Czkawka could approach the thread stack size limit of musl.
@docwisdom, to see if it's a stack size issue, could you try to run the following commands inside the container? This will increase the default stack size to 1MB.
export GOPATH=/go
add-pkg go git musl-dev
go install github.com/yaegashi/muslstack@latest
cp /usr/bin/czkawka_gui /usr/bin/czkawka_gui2
/go/bin/muslstack -s 0x100000 /usr/bin/czkawka_gui2
mv /usr/bin/czkawka_gui2 /usr/bin/czkawka_gui
Then restart the container and see if it's crashing again. If it does, you can try to increase to 8MB:
cp /usr/bin/czkawka_gui /usr/bin/czkawka_gui2
/go/bin/muslstack -s 0x800000 /usr/bin/czkawka_gui2
mv /usr/bin/czkawka_gui2 /usr/bin/czkawka_gui
Thanks for this. I tried both 1mb and 8mb settings and still had it crash at the end of the hashing
I'll note that [std::thread::Builder](https://doc.rust-lang.org/1.8.0/std/thread/struct.Builder.html) let you specify the stack size of the created thread from within the program. Only the stack size of the main thread is set by the OS.
so it is possible that thread stack size was set here and that is why this not worked(main thread in gui is not responsible for heavy calculation).
I already tried to set stack size in rayon with https://docs.rs/rayon/latest/rayon/struct.ThreadPoolBuilder.html#method.stack_size to 1 byte to see crash, but everything worked fine, so not sure where problem can be.
I tried to debug stack size with https://crates.io/crates/cargo-call-stack, but looks that it is not possible due several crashes and I don't know which other tool I could use to debug this problem.
In https://github.com/qarmin/czkawka/pull/1102 I changed some stack size values which may fix problem, but for me this values just works, so I cannot test if this will fix problem:
Will this be an upcomming release?
Yes, even now binaries to test are available here - https://github.com/qarmin/czkawka/actions/runs/6992056327, but since they are build with glibc not with musl, by running it, it is impossible to tell if the problem has been fixed
I've reproduced the stack overflow error. I'm currently testing a version that sets the stack size in rayon. I will let you know about the result.
Thank you
Finally, I don't seem to be able to reproduce in a consistent manner.
@docwisdom, could your try the jlesage/czkawka:issue-1140
Docker image and see if you can reproduce ?
Ive done 3 test batches so far (3-6k each) and no crashes. I think this may have resolved the issue. I am going to do a larger batch this morning.
Ran 160,000 photo comparisons using 32 gradient and it completed successfully. I would consider the issue resolved. Thank you.
@qarmin, this is the patch that @docwisdom tested:
--- a/czkawka_core/src/common.rs 2023-11-24 14:45:40.462095198 -0500
+++ b/czkawka_core/src/common.rs 2023-11-24 14:47:29.678337169 -0500
@@ -76,7 +76,7 @@
pub fn set_number_of_threads(thread_number: usize) {
NUMBER_OF_THREADS.set(thread_number);
- rayon::ThreadPoolBuilder::new().num_threads(get_number_of_threads()).build_global().unwrap();
+ rayon::ThreadPoolBuilder::new().num_threads(get_number_of_threads()).stack_size(8*1024*1204).build_global().unwrap();
}
pub const RAW_IMAGE_EXTENSIONS: &[&str] = &[
Do you want to integrate the change yourself or you want me to create a PR ?
I already added a little different limits - https://github.com/qarmin/czkawka/issues/1140#issuecomment-1826019763
Ok yes, in this PR, 4MB (DEFAULT_WORKER_THREAD_SIZE
) is used instead of 8MB. @docwisdom, I pushed jlesage/czkawka:issue-1140-2
, if you want to confirm that it's still working with a 4MB stack.
tested on 3900 photos, no issues
@jlesage @qarmin hey, is this rolled up into the available docker image too? I am experiencing the same issue when I run anything other than the standard selected algorithm for similar images. Tested on 3 machines of varying CPU strength, all resulting in the same issue. (pulled jlesage/czkawka image via docker compose)
EDIT: I went to dockerhub and saw that the "latest" tag is not in fact the lastest. There is an image with the tag 1140-2, which was commited around the same time as this issue was closed. I tried to use this image instead, and thus far (on smaller tests with 3-5k images which also caused the stackoverflow in the "latest" image) it's been working.
EDIT2: Still causes a stack overflow with 11k pictures.
is this rolled up into the available docker image too?
The latest version of Czkawka doesn't have the fix. The next version should include it.
I went to dockerhub and saw that the "latest" tag is not in fact the lastest. There is an image with the tag 1140-2, which was commited around the same time as this issue was closed.
This was a non-official image to test a potential fix.
EDIT2: Still causes a stack overflow with 11k pictures.
Can you try jlesage/czkawka:issue-1140
instead ?
[xvnc ] Tue Nov 21 14:22:35 2023 [xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [xvnc ] Tue Nov 21 14:27:46 2023 [xvnc ] VNCSConnST: closing /tmp/vnc.sock: Clean disconnection [xvnc ] EncodeManager: Framebuffer updates: 1523 [xvnc ] EncodeManager: Tight: [xvnc ] EncodeManager: Solid: 34 rects, 1.23945 Mpixels [xvnc ] EncodeManager: 544 B (1:9114.32 ratio) [xvnc ] EncodeManager: Bitmap RLE: 18 rects, 13.809 kpixels [xvnc ] EncodeManager: 582 B (1:95.2784 ratio) [xvnc ] EncodeManager: Indexed RLE: 2.615 krects, 429.034 kpixels [xvnc ] EncodeManager: 409.938 KiB (1:4.16297 ratio) [xvnc ] EncodeManager: Tight (JPEG): [xvnc ] EncodeManager: Full Colour: 1.622 krects, 2.05636 Mpixels [xvnc ] EncodeManager: 3.23682 MiB (1:2.42922 ratio) [xvnc ] EncodeManager: Total: 4.289 krects, 3.73865 Mpixels [xvnc ] EncodeManager: 3.63822 MiB (1:3.93348 ratio) [xvnc ] Connections: closed: /tmp/vnc.sock [xvnc ] ComparingUpdateTracker: 24.3215 Mpixels in / 1.10949 Mpixels out [xvnc ] ComparingUpdateTracker: (1:21.9214 ratio) [xvnc ] Tue Nov 21 14:38:02 2023 [xvnc ] Connections: accepted: /tmp/vnc.sock [xvnc ] Tue Nov 21 14:38:03 2023 [xvnc ] SConnection: Client needs protocol version 3.8 [xvnc ] SConnection: Client requests security type None(1) [xvnc ] VNCSConnST: Server default pixel format depth 24 (32bpp) little-endian rgb888 [xvnc ] VNCSConnST: Client pixel format depth 24 (32bpp) little-endian bgr888 [app ] thread '' has overflowed its stack
[app ] fatal runtime error: stack overflow
[supervisor ] service 'app' exited (got signal SIGABRT).
[supervisor ] service 'app' exited, shutting down...
[supervisor ] stopping service 'openbox'...
[supervisor ] service 'openbox' exited (with status 0).
[supervisor ] stopping service 'nginx'...
Bug Description When doing image comparison, fails just after hashing 180,000 images before showing results in GUI