tpwd / ke_search

Search Extension for TYPO3 Content Management System, including faceting search functions.
https://extensions.typo3.org/extension/ke_search/
GNU General Public License v3.0
8 stars 35 forks source link

Error while indexing PDF files #140

Closed markustobler closed 1 year ago

markustobler commented 1 year ago

I got the following errors in my cronjob: Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean) Syntax Error: Invalid object stream Syntax Error: Invalid object stream Syntax Error: Invalid object stream Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean) Syntax Error: Marked object is wrong type (boolean)

These errors are coming from pdftotext which apperantly has a problem reading the PDF files. At least the error handling could be improved so that in the ke_search log it is shown which files are problematic.

An example pdf file which causes an error is attached to the issue.

nvs-seminarprogramm-2023-1.pdf

christianbltr commented 1 year ago

This file does not cause problems in my test environment. I tested it with pdftotext version 20.09.0 and 0.86.1.

Which version of pdftotext are you using?

markustobler commented 1 year ago

I'am using "tpwd/ke_search": "^4.5". The error occurs only on the automatic cronjob. If I start the scheduler task manually there is no error. I have no idea how I could debug this? Somebody in the TYPO3 Slack channel told me that this error could be a problem with pdftotext. It could also be something else.

christianbltr commented 1 year ago

Errors from pdftotext and pdfinfo will now be logged to the ke_search error log. That should make it at least easier to find the problematic files. The patch is in the current master and will be in version 4.6.0.