nextcloud / fulltextsearch

🔍 Core of the full-text search framework for Nextcloud
https://apps.nextcloud.com/apps/fulltextsearch
GNU Affero General Public License v3.0
215 stars 51 forks source link

Raw text files are not indexed #571

Open Adspectus opened 4 years ago

Adspectus commented 4 years ago

I have several raw/plain text files, either with plain text content or with LaTeX source code. None of these files show up when I search for a word which is contained in them. Only pdf and office files will be find. Did I miss something in configuration? I have no external data storage and no encryption.

ArtificialOwl commented 4 years ago

please paste the result from:

./occ fulltextsearch:check
./occ fulltextsearch:test
Adspectus commented 4 years ago

When I wanted to do so, I noticed that elasticsearch did no longer run on my server. Restart failed, maybe because of this error message: "Plugin [ingest-attachment] was built for Elasticsearch version 7.6.1 but version 7.8.0 is running". I removed the plugin and then elasticsearch could be started. Now the output of the first command:

Full text search 1.4.1

- Search Platform:
Elasticsearch 1.5.1
{
    "elastic_host": [
        "http://localhost:9200"
    ],
    "elastic_index": "nextcloud",
    "fields_limit": "10000",
    "es_ver_below66": "0",
    "analyzer_tokenizer": "standard"
} 

- Content Providers:
Files 1.4.2
{
    "files_local": "1",
    "files_external": "2",
    "files_group_folders": "0",
    "files_encrypted": "0",
    "files_federated": "0",
    "files_size": "20",
    "files_pdf": "1",
    "files_office": "1",
    "files_image": "0",
    "files_audio": "0",
    "files_fulltextsearch_tesseract": {
        "version": "1.4.1",
        "enabled": "0",
        "psm": "4",
        "lang": "eng",
        "pdf": "0",
        "pdf_limit": "0"
    }
}

And the output of the second command:

.Testing your current setup:  
Creating mocked content provider. ok  
Testing mocked provider: get indexable documents. (2 items) ok  
Loading search platform. (Elasticsearch) ok  
Testing search platform. ok  
Locking process ok  
Removing test. ok  
Pausing 3 seconds 1 2 3 ok  
Initializing index mapping. ok  
Indexing generated documents. ok  
Pausing 3 seconds 1 2 3 ok  
Retreiving content from a big index (license). (size: 32386) ok  
Comparing document with source. ok  
Searching basic keywords:  
 - 'test' An unhandled exception has been thrown:
TypeError: Return value of OCA\FullTextSearch\Model\SearchRequest::getProviders() must be of the type array, null returned in /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Model/SearchRequest.php:114
Stack trace:
#0 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Model/SearchRequest.php(700): OCA\FullTextSearch\Model\SearchRequest->getProviders()
#1 [internal function]: OCA\FullTextSearch\Model\SearchRequest->jsonSerialize()
#2 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch_elasticsearch/lib/Service/SearchService.php(103): json_encode(Object(OCA\FullTextSearch\Model\SearchRequest))
#3 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch_elasticsearch/lib/Platform/ElasticSearchPlatform.php(336): OCA\FullTextSearch_ElasticSearch\Service\SearchService->searchRequest(Object(Elasticsearch\Client), Object(OCA\FullTextSearch\Model\SearchResult), Object(OC\FullTextSearch\Model\DocumentAccess))
#4 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(584): OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform->searchRequest(Object(OCA\FullTextSearch\Model\SearchResult), Object(OC\FullTextSearch\Model\DocumentAccess))
#5 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(436): OCA\FullTextSearch\Command\Test->search(Object(Symfony\Component\Console\Output\ConsoleOutput), Object(OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform), Object(OCA\FullTextSearch\Provider\TestProvider), Object(OC\FullTextSearch\Model\DocumentAccess), 'test', Array)
#6 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(171): OCA\FullTextSearch\Command\Test->testSearchSimple(Object(Symfony\Component\Console\Output\ConsoleOutput), Object(OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform), Object(OCA\FullTextSearch\Provider\TestProvider))
#7 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Test->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#8 /srv/www/vhosts/spicycloud.de/core/Command/Base.php(168): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(915): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(272): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Test), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(148): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /srv/www/vhosts/spicycloud.de/lib/private/Console/Application.php(214): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /srv/www/vhosts/spicycloud.de/console.php(99): OC\Console\Application->run()
#14 /srv/www/vhosts/spicycloud.de/occ(11): require_once('/srv/www/vhosts...')
#15 {main}
theroch commented 4 years ago

You have to remove and reinstall the ingest plugin after every update of elasticsearch. On ubuntu you have to run as root: /usr/share/elasticsearch/bin/elasticsearch-plugin remove ingest-attachment && /usr/share/elasticsearch/bin/elasticsearch-plugin install ingest-attachment

Adspectus commented 4 years ago

Thanks, I reinstalled now. The output of the first command is the same, but the second now shows this:

.Testing your current setup:  
Creating mocked content provider. ok  
Testing mocked provider: get indexable documents. (2 items) ok  
Loading search platform. (Elasticsearch) ok  
Testing search platform. ok  
Locking process fail 
In RunningService.php line 86:

  Index is already running  

fulltextsearch:test [--output [OUTPUT]] [-j|--json] [-d|--platform_delay PLATFORM_DELAY]
theroch commented 4 years ago

ps aux | grep php Then kill all running php processes which are related to index process.

Adspectus commented 4 years ago

Hm, how can I tell?

root      9031  0.0  0.1 518224 19428 ?        Ss   Jul11   0:24 php-fpm: master process (/etc/php/7.3/fpm/php-fpm.conf)
www-data  9032  0.0  0.0 518224  6868 ?        S    Jul11   0:00 php-fpm: pool www
www-data  9033  0.0  0.0 518224  6868 ?        S    Jul11   0:00 php-fpm: pool www
root     31407  0.0  0.0  15480   956 pts/0    S+   17:14   0:00 grep php
Adspectus commented 4 years ago

Ok, I just waited some time. Now I could run the test command again, but the output seems to be the same as before:

.Testing your current setup:  
Creating mocked content provider. ok  
Testing mocked provider: get indexable documents. (2 items) ok  
Loading search platform. (Elasticsearch) ok  
Testing search platform. ok  
Locking process ok  
Removing test. ok  
Pausing 3 seconds 1 2 3 ok  
Initializing index mapping. ok  
Indexing generated documents. ok  
Pausing 3 seconds 1 2 3 ok  
Retreiving content from a big index (license). (size: 32386) ok  
Comparing document with source. ok  
Searching basic keywords:  
 - 'test' An unhandled exception has been thrown:
TypeError: Return value of OCA\FullTextSearch\Model\SearchRequest::getProviders() must be of the type array, null returned in /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Model/SearchRequest.php:114
Stack trace:
#0 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Model/SearchRequest.php(700): OCA\FullTextSearch\Model\SearchRequest->getProviders()
#1 [internal function]: OCA\FullTextSearch\Model\SearchRequest->jsonSerialize()
#2 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch_elasticsearch/lib/Service/SearchService.php(103): json_encode(Object(OCA\FullTextSearch\Model\SearchRequest))
#3 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch_elasticsearch/lib/Platform/ElasticSearchPlatform.php(336): OCA\FullTextSearch_ElasticSearch\Service\SearchService->searchRequest(Object(Elasticsearch\Client), Object(OCA\FullTextSearch\Model\SearchResult), Object(OC\FullTextSearch\Model\DocumentAccess))
#4 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(584): OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform->searchRequest(Object(OCA\FullTextSearch\Model\SearchResult), Object(OC\FullTextSearch\Model\DocumentAccess))
#5 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(436): OCA\FullTextSearch\Command\Test->search(Object(Symfony\Component\Console\Output\ConsoleOutput), Object(OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform), Object(OCA\FullTextSearch\Provider\TestProvider), Object(OC\FullTextSearch\Model\DocumentAccess), 'test', Array)
#6 /srv/www/vhosts/spicycloud.de/apps/fulltextsearch/lib/Command/Test.php(171): OCA\FullTextSearch\Command\Test->testSearchSimple(Object(Symfony\Component\Console\Output\ConsoleOutput), Object(OCA\FullTextSearch_ElasticSearch\Platform\ElasticSearchPlatform), Object(OCA\FullTextSearch\Provider\TestProvider))
#7 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Command/Command.php(255): OCA\FullTextSearch\Command\Test->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#8 /srv/www/vhosts/spicycloud.de/core/Command/Base.php(168): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#9 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(915): OC\Core\Command\Base->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#10 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(272): Symfony\Component\Console\Application->doRunCommand(Object(OCA\FullTextSearch\Command\Test), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#11 /srv/www/vhosts/spicycloud.de/3rdparty/symfony/console/Application.php(148): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#12 /srv/www/vhosts/spicycloud.de/lib/private/Console/Application.php(214): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#13 /srv/www/vhosts/spicycloud.de/console.php(99): OC\Console\Application->run()
#14 /srv/www/vhosts/spicycloud.de/occ(11): require_once('/srv/www/vhosts...')
#15 {main}
Adspectus commented 4 years ago

Waited even longer and now the testcommand shows:

.Testing your current setup:  
Creating mocked content provider. ok  
Testing mocked provider: get indexable documents. (2 items) ok  
Loading search platform. (Elasticsearch) ok  
Testing search platform. ok  
Locking process ok  
Removing test. ok  
Pausing 3 seconds 1 2 3 ok  
Initializing index mapping. ok  
Indexing generated documents. ok  
Pausing 3 seconds 1 2 3 ok  
Retreiving content from a big index (license). (size: 32386) ok  
Comparing document with source. ok  
Searching basic keywords:  
 - 'test' (result: 1, expected: ["simple"]) ok  
 - 'document is a simple test' (result: 2, expected: ["simple","license"]) ok  
 - '"document is a test"' (result: 0, expected: []) ok  
 - '"document is a simple test"' (result: 1, expected: ["simple"]) ok  
 - 'document is a simple -test' (result: 1, expected: ["license"]) ok  
 - 'document is a simple +test' (result: 1, expected: ["simple"]) ok  
 - '-document is a simple test' (result: 0, expected: []) ok  
 - 'document is a simple +test +testing' (result: 1, expected: ["simple"]) ok  
 - 'document is a simple +test -testing' (result: 0, expected: []) ok  
 - 'document is a +simple -test -testing' (result: 0, expected: []) ok  
 - '+document is a simple -test -testing' (result: 1, expected: ["license"]) ok  
 - 'document is a +simple -license +testing' (result: 1, expected: ["simple"]) ok  
Updating documents access. ok  
Pausing 3 seconds 1 2 3 ok  
Searching with group access rights:  
 - 'license' - [] -  (result: 0, expected: []) ok  
 - 'license' - ["group_1"] -  (result: 1, expected: ["license"]) ok  
 - 'license' - ["group_1","group_2"] -  (result: 1, expected: ["license"]) ok  
 - 'license' - ["group_3","group_2"] -  (result: 1, expected: ["license"]) ok  
 - 'license' - ["group_3"] -  (result: 0, expected: []) ok  
Searching with share rights:  
 - 'license' - notuser -  (result: 0, expected: []) ok  
 - 'license' - user2 -  (result: 1, expected: ["license"]) ok  
 - 'license' - user3 -  (result: 1, expected: ["license"]) ok  
Removing test. ok  
Unlocking process ok  

Now words are found in .txt files, but not in .tex files, which are far more important, since I write a lot in LaTeX.

ArtificialOwl commented 4 years ago

Can you provide me with an example of your not indexed files ? maxence@nextcloud.com

Adspectus commented 4 years ago

Sure, I will send you a .tex file and the generated .pdf. When I search for a word in it, the result show only the pdf