nextcloud / files_fulltextsearch

🔍 Index the content of your files
GNU Affero General Public License v3.0
59 stars 30 forks source link

Very slow Indexing of Shared Files and a possible solution #281

Open rolandinus opened 2 months ago

rolandinus commented 2 months ago

When a file is updated in the index (e.g., renamed), the share names for all users with access to this file are updated, even if the user has not changed the share name. This process is extremely slow when there are many users with access to the file. It's likely related to Issue #256, which would be resolved if this process were faster. There reports in the nextcloud forums which seem to be related.

Current Behavior

Updating a single file triggers share name updates for all users with access. Renaming a folder updates all files in all subfolders. On large systems with many files and users, this can lead to an indexing queue that takes an excessive amount of time to complete (e.g., a week).

Details I identified the performance bottleneck in the following function in the FilesService:

private function getPathFromViewerId(int $fileId, string $viewerId): string {
    $viewerFiles = $this->rootFolder->getUserFolder($viewerId)
        ->getById($fileId);
}

Specifically, the ->getById($fileId) call is causing the slowdown.

Proposed Solution: I tried using the file path of the owner as a guess for other users with access, since this is the default in most cases. Using nodeExists($path) in each user's folder to check if it is valid, is approximately 50-100 times faster than calling getPathFromViewerId. (In case the file is allready in the index, the current share names might be a better first guess) If the guessed path is not valid for a user, fall back to the current method.

This approach should work well since the fulltext index only stores one access path per user anyway.

I have created a test implementation of the proposed solution. From an initial test, it seems to work fine and it is a lot faster.

Questions for Maintainers Are there any potential side effects or edge cases to consider? Do you have an idea for a better approach? I am happy to come up with a different solution, if someone can give me hint.

I am happy to create a pull request, but first I would like to have some feedback.