Crawler does not include external files (pdf)

zillion42 commented 3 months ago

Bug Report

Current Behavior The crawler builds its queue, with: c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:buildQueue 1 reindex --depth=20 --mode=queue but omits any external files. All other Html content is queued, processed & indexed just fine. If we enable frontend indexing by 'disableFrontendIndexing' => '0' and browse a page containing pdf's (via local file collections), pdf's are added to the queue. We're using xpdf-tools-win-4.05, 32-bit binaries, pdftotext.exe is tested and works. Pdf's can be processed in the queue after they have been added through frontend indexing. Pdf content is successfully found using typo3 search, after queue has been processed.

Expected behavior/output Building the queue should add external files, since 'useCrawlerForExternalFiles' => '1' is enabled.

Steps to reproduce Build the queue with settings as described (below), check the queue in table tx_crawler_queue . No external files are present.

Environment

Windows Server 2022 Standard 21H2 (VMWare)
Crawler version(s): 11.0.7
TYPO3 version(s): 10.4.37
Is your TYPO3 installation set up with Composer (Composer Mode): no

'crawler' => [
            'cleanUpOldQueueEntries' => '1',
            'cleanUpProcessedAge' => '2',
            'cleanUpScheduledAge' => '7',
            'countInARun' => '1000',
            'crawlHiddenPages' => '0',
            'enableTimeslot' => '1',
            'frontendBasePath' => '/',
            'makeDirectRequests' => '1',
            'maxCompileUrls' => '10000',
            'phpBinary' => 'php',
            'phpPath' => 'C:/php/php.exe',
            'processDebug' => '0',
            'processLimit' => '20',
            'processMaxRunTime' => '1000',
            'processVerbose' => '0',
            'purgeQueueDays' => '14',
            'sleepAfterFinish' => '0',
            'sleepTime' => '0',
        ],

'indexed_search' => [
            'catdoc' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'debugMode' => '0',
            'disableFrontendIndexing' => '1',
            'enableMetaphoneSearch' => '1',
            'flagBitMask' => '192',
            'fullTextDataLength' => '0',
            'ignoreExtensions' => '',
            'indexExternalURLs' => '0',
            'maxAge' => '0',
            'maxExternalFiles' => '250',
            'minAge' => '0',
            'pdf_mode' => '20',
            'pdftools' => 'C:\\httpd\\Apache24\\bin\\pdf2txt',
            'ppthtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'trackIpInStatistic' => '2',
            'unrtf' => '',
            'unzip' => '',
            'useCrawlerForExternalFiles' => '1',
            'useMysqlFulltext' => '0',
            'xlhtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
        ],

Possible Solution Unfortunately only workaround is enable frontend indexing with 'disableFrontendIndexing' => '0' and adding all external files to queue manually. Not really a working solution.

Additional context Edit: Enabling frontend indexing and browsing a page which contains external files does not immediately add those files to the queue. We found that before external files can be added to the queue, we first have to go to the indexing module in the backend and delete all previously queued content by clicking the trash icon on top (have to click the trash icon several times and make sure whole page is not indexed). After that, reloading the page in the frontend adds the external files. Edit2: Building the queue multiple consecutive times from console, info module, or scheduler, like other people have reported, does not help. Edit3: It might help to start with clean tables in the database. We have quite large indexing tables, unfortunately before we mirror our current environment to a testing environment, truncating those tables is not an option. 2024-04-03 17_14_55-Intranetsrv - intranetsrv - Remotedesktopverbindung Edit4: We recently installed a https certificate for our site and had trouble building or processing anything. This has been resolved by using Protocol for crawling - Force HTTPS for all pages and setting the correct Base URL in the crawling configuration on PID 1. Edit5: This is how the filepath backslaches are escaped in windows, screenshot directly from HeidiSQL tx_crawler_queue table. Maybe there is some problem with all the backslash escaping that only occurs on windows? Edit6: While processing PDF's (manually added) we are often getting following output:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Unprocessed Items remaining:1570 (67c43d15a5)
3

Edit7: Also this can occur:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
<warning>Doctrine\DBAL\Exception\UniqueConstraintViolationException: An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' with params [207539845, "ziel\/e", "122391892"]:

Duplicate entry '207539845' for key 'PRIMARY'</warning>
Unprocessed Items remaining:1002 (5ed3e4899a)
5

Edit8: Knowing what I know now, I have just indexed 1277 PDF's (2554, because they were queued twice, idk why), all searchable by content, going back all the way to the year 2016. It is very unfortunate that the crawler can not do, what can be done manually, which makes it unreliable and unpractical to use.

We have a quite big site, with almost daily edits, so it would be really great if we could figure out the problem.

tomasnorre commented 3 months ago

Thanks for reporting this, the PDFs issues are often a matter of configuration, so hard to debug.

Could you check if this is reproducible with the Crawler Devbox? https://github.com/tomasnorre/crawler/blob/main/CONTRIBUTING.md#devbox

zillion42 commented 3 months ago

Hi again,

I understand time is valuable. As you have probably already noticed, we are doing this for our work. I very much doubt that trying to reproduce the issue on a linux container with ddev is going to help us. I also guess sponsoring you with a one time payment of 25€ is not going to cut it.

We are exploring a few other options, like ke_search.

If, and that's a big IF, we could support you by paying you for your support, I see a few options here:

We could pay you for giving us support on our current system, which will for sure be messy and hard to debug. You would have to sign a order processing contract (Auftragsverarbeitungsvertrag).
We could pay you for giving us support on a windows VM with a dummy Site, this would take some time to setup beforehand. Unfortunately this solution would not take into account the complexity of the issue at hand, thousands of pdf's which have to be indexed reliably on a daily basis (incrementally).

Would be really great if there would be another way to contact you, other than via Github.

tomasnorre commented 3 months ago

Hi @zillion42

I cannot take on work atm due to personal reasons, but you could ask in the #TYPO3 #Crawler chat on Slack. Perhaps someone could help you better. I can click the release button, if a fix is provided, but I cannot do much more currently.

https://typo3.org/community/meet/chat-slack

You can also contact me via Slack but have a slow response time, for the same reasons as above.

tomasnorre commented 1 month ago

Hi @zillion42

I know It's been a while, but have tested this in the Crawler devbox (ddev).

If I don't add other items on the page with the PDFs it doesn't get index, so there need to be additional text on the page, not just a header and links.

Try to see if that changes anything for you. The pages and PDFs are indexed correctly in my setup.

tomasnorre / crawler

Crawler does not include external files (pdf) #1057

Bug Report