nextcloud / fulltextsearch

🔍 Core of the full-text search framework for Nextcloud
GNU Affero General Public License v3.0
210 stars 50 forks source link

Fulltextsearch hangs with complex PDF due to Ghostscript bug in version 10.0.0 #798

Open cue108 opened 11 months ago

cue108 commented 11 months ago

I know this is not related to Fulltextsearch but I thought to put the information in here to let people find a simple solution a little quicker: Ubuntu 23.04 serves with Ghostscript version 10.0.0. Also the Nextcloud docker image serves with GS 10.0.0.

During an indexing process, I noticed that it got stuck on a particular PDF file, and I found out that a simple text extraction via Ghostscript was hanging.

I went to the official GS site and downloaded Ghostscript 10.02.0 Source: https://ghostscript.com/releases/gsdnld.html

If you get "cannot find -lXext" during the linker stage simply install

Under Ubuntu: sudo apt-get install libxext-dev Under Fedora sudo dnf install libXext-devel Arch Linux sudo pacman -S libxext

And do the build again.

Find out where your gs is located with which gs and replace the binary. I have built it on an Ubuntu 23.04 and copied the binary into the official Nextcloud docker image.

gs -version
GPL Ghostscript 10.02.0 (2023-09-13)
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.

That solved my issue with a hanging index run.

May it help!

codejp3 commented 4 months ago

During an indexing process, I noticed that it got stuck on a particular PDF file, and I found out that a simple text extraction via Ghostscript was hanging.

I have a single PDF file that hangs too. It's complex and 23,404 pages long. No errors in FTS, and it just "hangs". How were you able to verify that it was gs causing the issue?

I'm tempted to go ahead and force an upgrade of GS like you did above. My server is Debian 12, which is also currently set to GS 10.00.0 in the repo. I'd kinda like to verify that's the issue first though.

cue108 commented 4 months ago

I was inspecting the bug tracker of gs back then amd simply built the latest version and replaced the binary. Then it worked.

16 Apr 2024 03:38:18 John Patrick Hayden III @.***>:

During an indexing process, I noticed that it got stuck on a particular PDF file, and I found out that a simple text extraction via Ghostscript was hanging.

I have a single PDF file that hangs too. It's complex and 23,404 pages long. No errors in FTS, and it just "hangs". How were you able to verify that it was gs causing the issue?

I'm tempted to go ahead and force an upgrade of GS like you did above. My server is Debian 12, which is also currently set to GS 10.00.0 in the repo. I'd kinda like to verify that's the issue first though.

— Reply to this email directly, view it on GitHub[https://github.com/nextcloud/fulltextsearch/issues/798#issuecomment-2058081021], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AAD5PQGSDBU4YUJW3BQIL5TY5R6IJAVCNFSM6AAAAAA5PDCNTSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJYGA4DCMBSGE]. You are receiving this because you authored the thread. [Tracking image][https://github.com/notifications/beacon/AAD5PQCHLOHU5IPF5AY6LFLY5R6IJA5CNFSM6AAAAAA5PDCNTSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT2VPJP2.gif]

codejp3 commented 4 months ago

Thank you! I have several complex files that just "hang" with no errors or indication of the problem. Trying your fix now.

EDIT: Reporting back. Several files that would "hang" have been indexed successfully. I probably have several more days of indexing until the process is complete due to # and size of files I have, but NOTICEABLE DIFFERENCE with GS 10.03.0!

THANK YOU!!!!!!