simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
847 stars 55 forks source link

Crashes on OCR image #325

Open lyc8503 opened 1 year ago

lyc8503 commented 1 year ago

Device Information (please complete the following information):

Command with arguments docker run --rm -v /mnt/user/:/tmp/host:ro -v /mnt/user/appdata/sist2_idx/:/idx -v /mnt/user/appdata/sist2_idx/chi_sim.traineddata:/usr/share/tessdata/chi_sim.traineddata:ro simon987/sist2:2.12.1-x64-linux -t 2 --ocr-lang eng+chi_sim --ocr-images --checksums scan /tmp/host/storage -o /idx/storage

Describe the bug It exits abnormally with log.

void ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS, BIT_VECTOR): Assertion `ClassTemplate->ProtoLengths[ActualProtoNum] < MAX_PROTO_INDEX' failed.

The container was removed after it finishes. I am sorry that I didn't get the full log, I will attach additional information when I reproduce it.

Steps To Reproduce Not sure, came across the problem when scanning my files. Seems it's scanning an image inside a zip file.

simon987 commented 1 year ago

Thanks! Please add a comment to this issue when you do add additional information so that I get the notification

lyc8503 commented 1 year ago

It crashed again. Detailed logs attached below.

sist2: /vcpkg/buildtrees/tesseract/src/4.1.1-81819c4317.clean/src/classify/intmatcher.cpp:1155: void ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS, BIT_VECTOR): Assertion `ClassTemplate->ProtoLengths[ActualProtoNum] < MAX_PROTO_INDEX' failed.

[14B0A49FF640] [2023-01-12 04:35:11] [ERROR *SIGNAL HANDLER*] =============================================

[14B0A47FE640] [2023-01-12 04:35:11] [ERROR /tmp/host/storage/<REMOVED>/Users/<REMOVED>/Documents/Tencent Files/<REMOVED>/Image/Group2/Z$/J{/Z$J{C]Z%8L)L4_17VIJL$E1.jpg] (media.c) avformat_open_input() returned [-2] No such file or directory
[14B0A49FF640] [2023-01-12 04:35:11] [ERROR *SIGNAL HANDLER*] Uh oh! Caught fatal signal: Aborted
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG *SIGNAL HANDLER*] THREAD [14B0A47FE640] was working on job /tmp/host/storage/<REMOVED>/Users/<REMOVED>/Documents/Tencent Files/<REMOVED>/Image/Group2/Z$/J{/Z$J{C]Z%8L)L4_17VIJL$E1.jpg
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG *SIGNAL HANDLER*] THREAD [14B0A49FF640] was working on job /tmp/host/storage/<REMOVED>/Documents/Tencent Files/<REMOVED>/FileRecv/一些中文.zip#/中文/中文/2018/中文/中文/中文#/中文.pdf
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG tpool.c] pool->thread_cnt = 2
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG tpool.c] pool->work_cnt = 1881282
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG tpool.c] pool->done_cnt = 1635072
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG tpool.c] pool->busy_cnt = 2
[14B0A49FF640] [2023-01-12 04:35:11] [DEBUG tpool.c] pool->stop = 0
[14B0A49FF640] [2023-01-12 04:35:11] [INFO *SIGNAL HANDLER*] Please consider creating a bug report at https://github.com/simon987/sist2/issues !
[14B0A49FF640] [2023-01-12 04:35:11] [INFO *SIGNAL HANDLER*] sist2 is an open source project and relies on the collaboration of its users to diagnose and fix bugs
[14B0A49FF640] [2023-01-12 04:35:11] [WARNING *SIGNAL HANDLER*] You are running sist2 in release mode! Please consider downloading the debug binary from the Github releases page to provide additionnal information when submitting a bug report.
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0e264300 still has count 1 (id /usr/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0f094430 still has count 1 (id /usr/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0f094490 still has count 1 (id /usr/share/tessdata/eng.traineddatanumber-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0f094530 still has count 1 (id /usr/share/tessdata/eng.traineddatabigram-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0f094b60 still has count 1 (id /usr/share/tessdata/eng.traineddatafreq-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0fad7eb0 still has count 1 (id /usr/share/tessdata/chi_sim.traineddatapunc-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e0fd1d200 still has count 1 (id /usr/share/tessdata/chi_sim.traineddataword-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e16c518c0 still has count 1 (id /usr/share/tessdata/chi_sim.traineddatanumber-dawg)
ObjectCache(0x558e0d1247e0)::~ObjectCache(): WARNING! LEAK! object 0x558e16c51860 still has count 1 (id /usr/share/tessdata/chi_sim.traineddatafreq-dawg)
[14B0A47FE640] [2023-01-12 04:35:11] [DEBUG /tmp/host/storage/<REMOVED>/Users/<REMOVED>/Documents/Tencent Files/<REMOVED>/Image/Group2/Z$/J{/Z$J{C]Z%8L)L4_17VIJL$E1.jpg] Starting parse job {b5573cbdbc1e225b9ec27c2944c024fa}

I removed parts of path appeared in the log since they are sensitive. The first path is a JPG, The second path is a pdf inside a zipped zip.

simon987 commented 1 year ago

Hi, could you attach the JPG & PDF files? Or send them privately to me at me@simon987.net ?

lyc8503 commented 1 year ago

I have sent the files via email, please check.

simon987 commented 1 year ago

I received the files, thanks

simon987 commented 1 year ago

Hi, can you try again with this image: https://hub.docker.com/layers/simon987/sist2/x64-linux/images/sha256-10ae63d49163d64afaf98b9c1bac68218ad33fa0acbec0987dd8c4453aaeaf30 ? (https://ci.simon987.net/simon987/sist2/240/1/4)

I added chi_sim OCR trained data in the docker image so you don't need to mount it anymore.

You will need to add --ocr-ebooks if you want to enable OCR for the .pdf file

lyc8503 commented 1 year ago

Thanks, I am trying it. It takes some time on my machine and I will let you know when it finishes.

lyc8503 commented 1 year ago

Command with arguments: docker run --rm -v /mnt/user/:/tmp/host:ro -v /mnt/user/appdata/sist2_idx/:/idx simon987/sist2:x64-linux -t 2 --ocr-lang eng+chi_sim --ocr-images --ocr-ebooks --checksums scan /tmp/host/storage -o /idx/storage

Sadly, it crashes again with the log below.

OSD: Weak margin (5.67) for 630 blob text block, but using orientation anyway: 1
Detected 318 diacritics
Estimating resolution as 223
OSD: Weak margin (0.04) for 55 blob text block, but using orientation anyway: 3
Estimating resolution as 416
OSD: Weak margin (1.55) for 54 blob text block, but using orientation anyway: 0
Estimating resolution as 454
Estimating resolution as 237
Image too small to scale!! (2x36 vs min width of 3)
Line cannot be recognized!!
Estimating resolution as 161
Estimating resolution as 161
Detected 349 diacritics
Detected 270 diacritics
Estimating resolution as 160
OSD: Weak margin (5.13) for 1503 blob text block, but using orientation anyway: 0
Detected 48 diacritics
sist2: /vcpkg/buildtrees/tesseract/src/4.1.1-81819c4317.clean/src/classify/intmatcher.cpp:1155: void ScratchEvidence::UpdateSumOfProtoEvidences(INT_CLASS, BIT_VECTOR): Assertion `ClassTemplate->ProtoLengths[ActualProtoNum] < MAX_PROTO_INDEX' failed.

[14AC8ABFF640] [2023-01-14 16:17:34] [ERROR *SIGNAL HANDLER*] =============================================

[14AC8ABFF640] [2023-01-14 16:17:34] [ERROR *SIGNAL HANDLER*] Uh oh! Caught fatal signal: Aborted
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG *SIGNAL HANDLER*] THREAD [14AC8A9FE640] was working on job /tmp/host/storage/.Recycle.Bin/#recycle/一些中文/某某/某某/某某_1.zip#/�߶��Ϻ���/����ʡ�ձ����У����ݡ���������Ƹۡ��Ǩ��2020������һ�ε��п��ԣ���ĩ���ԣ���ѧ���⣨PDF�棩1.pdf
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG *SIGNAL HANDLER*] THREAD [14AC8ABFF640] was working on job /tmp/host/storage/.Recycle.Bin/#recycle/temp/某某.zip#/�߿�/ǿ����/���߿�-רҵ̽����.png
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG tpool.c] pool->thread_cnt = 2
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG tpool.c] pool->work_cnt = 1015796
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG tpool.c] pool->done_cnt = 15796
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG tpool.c] pool->busy_cnt = 2
[14AC8ABFF640] [2023-01-14 16:17:34] [DEBUG tpool.c] pool->stop = 0
[14AC8ABFF640] [2023-01-14 16:17:34] [INFO *SIGNAL HANDLER*] Please consider creating a bug report at https://github.com/simon987/sist2/issues !
[14AC8ABFF640] [2023-01-14 16:17:34] [INFO *SIGNAL HANDLER*] sist2 is an open source project and relies on the collaboration of its users to diagnose and fix bugs
[14AC8ABFF640] [2023-01-14 16:17:34] [WARNING *SIGNAL HANDLER*] You are running sist2 in release mode! Please consider downloading the debug binary from the Github releases page to provide additionnal information when submitting a bug report.
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c103065300 still has count 2 (id /usr/share/tessdata/eng.traineddatapunc-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c103e95430 still has count 2 (id /usr/share/tessdata/eng.traineddataword-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c103e95490 still has count 2 (id /usr/share/tessdata/eng.traineddatanumber-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c103e95530 still has count 2 (id /usr/share/tessdata/eng.traineddatabigram-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c103e95b60 still has count 2 (id /usr/share/tessdata/eng.traineddatafreq-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c1048d8eb0 still has count 2 (id /usr/share/tessdata/chi_sim.traineddatapunc-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c104b1e200 still has count 2 (id /usr/share/tessdata/chi_sim.traineddataword-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c10ba528c0 still has count 2 (id /usr/share/tessdata/chi_sim.traineddatanumber-dawg)
ObjectCache(0x55c102373020)::~ObjectCache(): WARNING! LEAK! object 0x55c10ba52860 still has count 2 (id /usr/share/tessdata/chi_sim.traineddatafreq-dawg)

I notice there's some decoding error in the log. Zip files created on Windows will use GB2312 in default to encode Chinese characters instead of UTF-8, could it be the cause?

I am also willing to offer the file samples if you need them.

simon987 commented 1 year ago

Hi it could be the cause. Could you send an example .zip file that causes the crash?

lyc8503 commented 1 year ago

Have sent it via email.

lyc8503 commented 1 year ago

It's usually hard to process strings containing Chinese, especially those generated in Windows, different encodings are always causing troubles. The zip file I sent to you could be unzipped successfully on Windows using both built-in unzip and 7-zip, but it causes error on Linux and Google Drive.

I think could we just supress the error encountered, skip the files and give a warning instead of crashing the whole program?