tomasnorre / crawler

Libraries and scripts for crawling the TYPO3 page tree. Used for re-caching, re-indexing, publishing applications etc.
GNU General Public License v3.0
54 stars 83 forks source link

Crawler does not include external files (pdf) #1057

Open zillion42 opened 3 months ago

zillion42 commented 3 months ago

Bug Report

Current Behavior The crawler builds its queue, with: c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:buildQueue 1 reindex --depth=20 --mode=queue but omits any external files. All other Html content is queued, processed & indexed just fine. If we enable frontend indexing by 'disableFrontendIndexing' => '0' and browse a page containing pdf's (via local file collections), pdf's are added to the queue. We're using xpdf-tools-win-4.05, 32-bit binaries, pdftotext.exe is tested and works. Pdf's can be processed in the queue after they have been added through frontend indexing. Pdf content is successfully found using typo3 search, after queue has been processed.

Expected behavior/output Building the queue should add external files, since 'useCrawlerForExternalFiles' => '1' is enabled.

Steps to reproduce Build the queue with settings as described (below), check the queue in table tx_crawler_queue . No external files are present.

Environment

'crawler' => [
            'cleanUpOldQueueEntries' => '1',
            'cleanUpProcessedAge' => '2',
            'cleanUpScheduledAge' => '7',
            'countInARun' => '1000',
            'crawlHiddenPages' => '0',
            'enableTimeslot' => '1',
            'frontendBasePath' => '/',
            'makeDirectRequests' => '1',
            'maxCompileUrls' => '10000',
            'phpBinary' => 'php',
            'phpPath' => 'C:/php/php.exe',
            'processDebug' => '0',
            'processLimit' => '20',
            'processMaxRunTime' => '1000',
            'processVerbose' => '0',
            'purgeQueueDays' => '14',
            'sleepAfterFinish' => '0',
            'sleepTime' => '0',
        ],
'indexed_search' => [
            'catdoc' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'debugMode' => '0',
            'disableFrontendIndexing' => '1',
            'enableMetaphoneSearch' => '1',
            'flagBitMask' => '192',
            'fullTextDataLength' => '0',
            'ignoreExtensions' => '',
            'indexExternalURLs' => '0',
            'maxAge' => '0',
            'maxExternalFiles' => '250',
            'minAge' => '0',
            'pdf_mode' => '20',
            'pdftools' => 'C:\\httpd\\Apache24\\bin\\pdf2txt',
            'ppthtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'trackIpInStatistic' => '2',
            'unrtf' => '',
            'unzip' => '',
            'useCrawlerForExternalFiles' => '1',
            'useMysqlFulltext' => '0',
            'xlhtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
        ],

Possible Solution Unfortunately only workaround is enable frontend indexing with 'disableFrontendIndexing' => '0' and adding all external files to queue manually. Not really a working solution.

Additional context Edit: Enabling frontend indexing and browsing a page which contains external files does not immediately add those files to the queue. We found that before external files can be added to the queue, we first have to go to the indexing module in the backend and delete all previously queued content by clicking the trash icon on top (have to click the trash icon several times and make sure whole page is not indexed). After that, reloading the page in the frontend adds the external files. Edit2: Building the queue multiple consecutive times from console, info module, or scheduler, like other people have reported, does not help. Edit3: It might help to start with clean tables in the database. We have quite large indexing tables, unfortunately before we mirror our current environment to a testing environment, truncating those tables is not an option. 2024-04-03 17_14_55-Intranetsrv - intranetsrv - Remotedesktopverbindung Edit4: We recently installed a https certificate for our site and had trouble building or processing anything. This has been resolved by using Protocol for crawling - Force HTTPS for all pages and setting the correct Base URL in the crawling configuration on PID 1. image Edit5: This is how the filepath backslaches are escaped in windows, screenshot directly from HeidiSQL tx_crawler_queue table. Maybe there is some problem with all the backslash escaping that only occurs on windows? image Edit6: While processing PDF's (manually added) we are often getting following output:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Unprocessed Items remaining:1570 (67c43d15a5)
3

Edit7: Also this can occur:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
<warning>Doctrine\DBAL\Exception\UniqueConstraintViolationException: An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' with params [207539845, "ziel\/e", "122391892"]:

Duplicate entry '207539845' for key 'PRIMARY'</warning>
Unprocessed Items remaining:1002 (5ed3e4899a)
5

Edit8: Knowing what I know now, I have just indexed 1277 PDF's (2554, because they were queued twice, idk why), all searchable by content, going back all the way to the year 2016. It is very unfortunate that the crawler can not do, what can be done manually, which makes it unreliable and unpractical to use.

We have a quite big site, with almost daily edits, so it would be really great if we could figure out the problem.

tomasnorre commented 3 months ago

Thanks for reporting this, the PDFs issues are often a matter of configuration, so hard to debug.

Could you check if this is reproducible with the Crawler Devbox? https://github.com/tomasnorre/crawler/blob/main/CONTRIBUTING.md#devbox

zillion42 commented 3 months ago

Hi again,

I understand time is valuable. As you have probably already noticed, we are doing this for our work. I very much doubt that trying to reproduce the issue on a linux container with ddev is going to help us. I also guess sponsoring you with a one time payment of 25€ is not going to cut it.

We are exploring a few other options, like ke_search.

If, and that's a big IF, we could support you by paying you for your support, I see a few options here:

Would be really great if there would be another way to contact you, other than via Github.

tomasnorre commented 3 months ago

Hi @zillion42

I cannot take on work atm due to personal reasons, but you could ask in the #TYPO3 #Crawler chat on Slack. Perhaps someone could help you better. I can click the release button, if a fix is provided, but I cannot do much more currently.

https://typo3.org/community/meet/chat-slack

You can also contact me via Slack but have a slow response time, for the same reasons as above.

tomasnorre commented 1 month ago

Hi @zillion42

I know It's been a while, but have tested this in the Crawler devbox (ddev).

If I don't add other items on the page with the PDFs it doesn't get index, so there need to be additional text on the page, not just a header and links.

Try to see if that changes anything for you. The pages and PDFs are indexed correctly in my setup.