nextcloud / fulltextsearch_elasticsearch

🔍 Use Elasticsearch to index the content of your Nextcloud
https://apps.nextcloud.com/apps/fulltextsearch_elasticsearch
GNU Affero General Public License v3.0
82 stars 32 forks source link

Again about reindexing #269

Open ostasevych opened 1 year ago

ostasevych commented 1 year ago

Hi! I am asking for help. Almost a week after moving from server installed elasticsearch to the dockerised I am struggling to make the full text search working on my instance:

Reindexing done with the following commands:

$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:stop
$ curl -X DELETE localhost:9200/my-index
$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:reset
$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:index

Searching with curl in index "my-index" for the word "Ivanova" (our member) gives (user data are stripped and scratched):

$ curl -X GET "localhost:9200/my-index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "title": "Ivanova"
    }
  }
}
'
{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 6,
      "relation" : "eq"
    },
    "max_score" : 9.4341955,
    "hits" : [
      {
        "_index" : "my-index",
        "_id" : "files:6811",
        "_score" : 9.4341955,
        "_ignored" : [
          "content.keyword"
        ],
        "_source" : {
          "owner" : "user1",
          "groups" : [
            "GroupA",
            "GroupB",
            "GroupC",
            "GroupD"
          ],
          "circles" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "source" : "files_group_folders",
          "title" : "GroupA/Scanned/Ivanova.pdf",
          "users" : [ ],
          "content" : "Ivanova document"
          "tags" : [ ],
          "attachment" : {
            "date" : "2022-04-06T06:29:11Z",
            "content_type" : "application/pdf",
            "format" : "application/pdf; version=\"A-2b\"",
            "modified" : "2022-04-06T06:29:11Z",
            "language" : "uk",
            "creator_tool" : "ABBYY FineReader 14",
            "content_length" : 1612
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [ ],
          "share_names" : {
            "user0" : "GroupA/Scanned/Ivanov.pdf",
            "user1" : "Scanned/Ivanov.pdf",
            "user2" : "GroupA/Scanned/Ivanov.pdf",
            "user3" : ""
          },
          "hash" : "aeca335860b2f59954q5e7fd34b174a1"
        }
      },
      {
        "_index" : "my-index",
        "_id" : "files:6812",
        "_score" : 9.4341955,
        "_ignored" : [
          "content.keyword"
        ],
        "_source" : {
          "owner" : "user0",
          "groups" : [
            "GroupA",
            "GroupB",
            "GroupC",
            "GroupD"
          ],
          "circles" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "source" : "files_group_folders",
          "title" : "GroupA/Scanned/Ivanova2.pdf",
          "users" : [ ],
          "content" : "Ivanova........................................................... ...................",
          "tags" : [ ],
          "attachment" : {
            "date" : "2022-04-01T08:00:50Z",
            "content_type" : "application/pdf",
            "format" : "application/pdf; version=\"A-2b\"",
            "modified" : "2022-04-01T08:00:50Z",
            "language" : "uk",
            "creator_tool" : "ABBYY FineReader 14",
            "content_length" : 1260
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [ ],
          "share_names" : {
            "user0" : "GroupA/Scanned/Ivanova2.pdf",
            "user1" : "Scanned/Ivanova2.pdf",
            "user2" : "GroupA/Scanned/Ivanova2.pdf",
            "user3" : "",
          },
          "hash" : "e09e889376ebe62b907b8023f37d21a9"
        }
      },
      {
        "_index" : "my-index",
        "_id" : "files:1576",
        "_score" : 9.30954,
        "_source" : {
          "owner" : "user0",
          "groups" : [
            "GroupA",
            "GroupB",
            "GroupC",
            "GroupD"
          ],
          "circles" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "source" : "files_group_folders",
          "title" : "GroupA/Scanned/Ivaniva CV.pdf",
          "users" : [ ],
          "content" : "",
          "tags" : [ ],
          "attachment" : {
            "date" : "2022-06-02T07:57:28Z",
            "keywords" : "Scanned image",
            "content_type" : "application/pdf",
            "author" : "NAPS2",
            "format" : "application/pdf; version=1.4",
            "modified" : "2022-06-02T07:57:28Z",
            "language" : "lt",
            "title" : "Scanned image",
            "creator_tool" : "NAPS2",
            "content_length" : 4
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [ ],
          "share_names" : {
            "user0" : "GroupA/Scanned/Ivanova CV.pdf",
            "user5" : "GroupA/Scanned/Ivanova CV.pdf",
            "user3" : ""
          },
          "hash" : "53f4a218a7ac0f31648efc0834c35199"
        }
      },
      {
        "_index" : "my-index",
        "_id" : "files:518673",
        "_score" : 8.218001,
        "_ignored" : [
          "content.keyword"
        ],
        "_source" : {
          "owner" : "user0",
          "groups" : [
            "GroupD"
          ],
          "circles" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "source" : "files_group_folders",
          "title" : "GroupA/Scanned/Ivanova Bulletin.docx",
          "users" : [
            "user8"
          ],
          "content" : "Ivanova Bulletin text text",
          "tags" : [ ],
          "attachment" : {
            "date" : "2023-06-13T09:19:00Z",
            "content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "author" : "Administrator",
            "modifier" : "User6",
            "modified" : "2023-06-14T13:32:00Z",
            "language" : "uk",
            "content_length" : 1701,
            "print_date" : "2023-06-14T13:30:00Z"
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [ ],
          "share_names" : {
            "user0" : "GroupA/Bulls/Ivanova Bull.docx",
            "user9" : "GroupA/Bulls/Ivanova Bull.docx"
          },
          "hash" : "99a498dbb1db6f68a6be3793e30a9476"
        }
      },
      {
        "_index" : "my-index",
        "_id" : "files:6806",
        "_score" : 7.622202,
        "_source" : {
          "share_names" : {
            "user0" : "GroupA/Scanned/Photos/Ivanova",
            "user2" : "Scanned/Photos/Ivanova",
            "user3" : ""
          },
          "owner" : "user0",
          "users" : [ ],
          "groups" : [
            "GroupA",
            "GroupB",
            "GroupC",
            "GroupD"
          ],
          "circles" : [ ],
          "links" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "subtags" : [ ],
          "tags" : [ ],
          "hash" : "",
          "provider" : "files",
          "source" : "files_group_folders",
          "title" : "GroupA/Scanned/Ivanova",
          "parts" : [ ],
          "content" : ""
        }
      },
      {
        "_index" : "my-index",
        "_id" : "files:466569",
        "_score" : 5.747198,
        "_ignored" : [
          "content.keyword"
        ],
        "_source" : {
          "owner" : "user0",
          "groups" : [
            "GroupD"
          ],
          "circles" : [ ],
          "metatags" : [
            "files_group_folders"
          ],
          "source" : "files_group_folders",
          "title" : "GroupA/Applications/Ivanova.docx",
          "users" : [
            "user4",
            "user5",
            "user6",
            "user7"
          ],
          "content" : "Application of Ivanova text text text",
          "tags" : [ ],
          "attachment" : {
            "date" : "2023-03-10T12:25:00Z",
            "content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
            "author" : "user5",
            "modifier" : "user10",
            "modified" : "2023-05-17T17:55:09Z",
            "language" : "uk",
            "content_length" : 1550
          },
          "provider" : "files",
          "subtags" : [ ],
          "parts" : {
            "comments" : ""
          },
          "links" : [ ],
          "share_names" : {
            "user0" : "GroupA/Applications/Ivanova.docx",
            "user6" : "GroupA/Applications/Ivanova.docx",
            "user3" : "",
          },
          "hash" : "f8c12747f9b240ed696385d2aff7f0fe"
        }
      }
    ]
  }
}

Next, I search for Ivanova on behind of a user0

$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:search user0 Ivanova
search
> Files
 - 518673 score:0
 - 480179 score:0
 - 514585 score:0
 - 527182 score:0
 - 531276 score:0
 - 363692 score:0

The documents are stored in the groupfolder, which is accessible to user0, user1, user2, user3, user10, but that's what I see when searching with the help of NC fulltextsearch app:

$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:search user10 Ivanova
search
> Files
$ sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:search user3 Ivanova
search
> Files
 - 97091 score:0

Obviously, users in webUI cannot see the proper output.

Would you be so kind to explain, why it is not working as expected? What should I do to make it working? Thanks!

R0Wi commented 1 year ago

To be able to debug the ES queries generated by this app, I'd recommend to set your loglevel to 0 in your config.php. The app should then log the query which is sent to the ES server before actually sending it (you should see some log message like Searching ES ... after your sudo -u www-data php /var/www/nextcloud/occ fulltextsearch:search ... command).

Unfortunately the debug logging is currently broken, that's why you might need to adjust the following code a bit to be able to actually see the query body:

try {
    $serializedParams = var_export($query['params'], true);
    $this->logger->debug('Searching ES: ' . $serializedParams);
    $result = $client->search($query['params']);
}

(Will try to bring in a PR to fix the debug logging within the next days).

After that, analyze the JSON query which is sent to ES and now written to your NC log. Most likely it will contain a filter section which tries to ensure that users do not see results for files where they don't have permissions. Try to figure out which of these filters are filtering out your expected results. I'd guess that the share_names array is the culprit:

"share_names" : {
            "user0" : "GroupA/Applications/Ivanova.docx",
            "user6" : "GroupA/Applications/Ivanova.docx",
            "user3" : "",
          },

A ES document is only visible to his owner or in your case also to anyone listed in the share_names array. Think your documents share_name entries are just missing most of your users (I'd expect users 1-10 to be listed here ...).

That would be a hint that your index process ended with errors. In that case please reindex your documents end inspect your Nextcloud logfile while leaving the loglevel set to 0.

it25fg commented 1 year ago

Out of curiosity: https://github.com/nextcloud/fulltextsearch_elasticsearch#compatibility says the app is ONLY compatible with ES 7? OP has used a container with ES 8.6.1 -- please somebody clarify if this CAN even work?

R0Wi commented 1 year ago

Docs are outdated. If you have a look at the composer dependencies, you'll see that since app version 26 the ES client 8.6.1 is used. So app version >=26 is ONLY compatible with ES server 8.6.x.

it25fg commented 1 year ago

Docs are outdated. If you have a look at the composer dependencies, you'll see that since app version 26 the ES client 8.6.1 is used. So app version >=26 is ONLY compatible with ES server 8.6.x.

Thanks for the clarification. As always, the real information is buried in the sources. Wouldn't it be a nice gesture to the admin who wants to install this app for his users: let him know upfront that this app is compatible with a distinct ElasticSearch version? (in Nextcloud admin panel -> apps -> fulltextsearch_elasticsearch -> details)?

it25fg commented 1 year ago

Now I'm fully on track: NC on 26.0.4, ES on 8.6.1. And it seems the same result as described here: everything is fully indexed (index rebuilt with zero errors), but the query for files does not yield results. In particular:

Shall I open a new issue for this (the difference is: there are no groupfolders involved) or which info do you need to track this down?

ostasevych commented 1 year ago

Now I'm fully on track: NC on 26.0.4, ES on 8.6.1. And it seems the same result as described here: everything is fully indexed (index rebuilt with zero errors), but the query for files does not yield results. In particular:

  • I can query manually /indexname/_query?q=a_search_term and I get the expected entries. I can verify that all the infos around the document (owner, shares, groups etc.) are there.
  • A query done by occ fulltextsearch:query does not show this result, even if the querying user is the owner of the document, or verifiable in the 'users' array as well as in the 'share_names' dictionary. Other content providers don't seem to be affected (I have deck, this returns expected results).

Shall I open a new issue for this (the difference is: there are no groupfolders involved) or which info do you need to track this down?

Reading your observations, I understand that the issue is hiding somewhere deeper. After the initial indexing, which has brought ambiguous results the elasticsearch+nexrtcloud starts working fine, so, when I modify or delete or add some document the cron does it work and the documents are indexing properly independently of either local or group folders.

it25fg commented 1 year ago

Now I'm fully on track: NC on 26.0.4, ES on 8.6.1. And it seems the same result as described here: everything is fully indexed (index rebuilt with zero errors), but the query for files does not yield results. In particular:

  • I can query manually /indexname/_query?q=a_search_term and I get the expected entries. I can verify that all the infos around the document (owner, shares, groups etc.) are there.
  • A query done by occ fulltextsearch:query does not show this result, even if the querying user is the owner of the document, or verifiable in the 'users' array as well as in the 'share_names' dictionary. Other content providers don't seem to be affected (I have deck, this returns expected results).

Shall I open a new issue for this (the difference is: there are no groupfolders involved) or which info do you need to track this down?

I have decided to open #277 for this problem. It seems too different from this issue which is about indexing.