nextcloud / server

☁️ Nextcloud server, a safe home for all your data
https://nextcloud.com
GNU Affero General Public License v3.0
26.45k stars 3.98k forks source link

[Bug]: File and tag search slow if a lot of shares present #35776

Open XueSheng-GIT opened 1 year ago

XueSheng-GIT commented 1 year ago

⚠️ This issue respects the following points: ⚠️

Bug description

Global/Unified search for files and tags is slow if a lot of shares are present. Search result for fulltextsearch, collectives, talk, deck and mails popup nearly instantly, but result for files and tags takes ages until they appear (approx. 30 seconds). Tested accounts have approx. 600 shares (according to oc_shares share_with column. Talk is used a lot for sharing photos, which is probably the main reason for the amount of shares.

Postgres log shows slow query... see below.

Doing the same on a cloned server instance without these shares (shares removed), the search results for files and tags appear within 2 seconds.

Just want to reference https://github.com/nextcloud/server/issues/23835 which really improved things in regards to search speed. Unfortunately it seems this issue is not fixed if a lot of shares are present.

Steps to reproduce

  1. Login to an account with a lot of received shares (in my case approx. 600).
  2. Open search (it makes no difference if you use the webif or the android client).
  3. Enter a search string
  4. See first search results... spinner keeps active for approx 30 seconds until results for files and tags are shown.

Expected behavior

Search result for files and tags should appear as fast as possible, even if a lot of shares are received.

Installation method

Community Manual installation with Archive

Operating system

Debian/Ubuntu

PHP engine version

PHP 8.1

Web server

Apache (supported)

Database engine version

PostgreSQL

Is this bug present after an update or on a fresh install?

Updated to a major version (ex. 22.2.3 to 23.0.1)

Are you using the Nextcloud Server Encryption module?

None

What user-backends are you using?

Configuration report

{
    "system": {
        "instanceid": "***REMOVED SENSITIVE VALUE***",
        "passwordsalt": "***REMOVED SENSITIVE VALUE***",
        "secret": "***REMOVED SENSITIVE VALUE***",
        "trusted_domains": [
            "cloud.domain.de"
        ],
        "datadirectory": "***REMOVED SENSITIVE VALUE***",
        "dbtype": "pgsql",
        "version": "25.0.2.3",
        "dbname": "***REMOVED SENSITIVE VALUE***",
        "dbhost": "***REMOVED SENSITIVE VALUE***",
        "dbport": "",
        "dbtableprefix": "oc_",
        "dbuser": "***REMOVED SENSITIVE VALUE***",
        "dbpassword": "***REMOVED SENSITIVE VALUE***",
        "installed": true,
        "forcessl": true,
        "forceSSLforSubdomains": true,
        "overwrite.cli.url": "https:\/\/cloud.domain.de\/",
        "overwritehost": "cloud.domain.de",
        "overwriteprotocol": "https",
        "overwritewebroot": "\/",
        "htaccess.RewriteBase": "\/",
        "trusted_proxies": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpmode": "smtp",
        "mail_smtpsecure": "ssl",
        "mail_sendmailmode": "smtp",
        "mail_from_address": "***REMOVED SENSITIVE VALUE***",
        "mail_domain": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpauthtype": "LOGIN",
        "mail_smtpauth": true,
        "mail_smtphost": "***REMOVED SENSITIVE VALUE***",
        "mail_smtpport": "465",
        "mail_smtpname": "***REMOVED SENSITIVE VALUE***",
        "mail_smtppassword": "***REMOVED SENSITIVE VALUE***",
        "theme": "",
        "logtimezone": "Europe\/Berlin",
        "loglevel": 3,
        "log_rotate_size": 104857600,
        "maintenance": false,
        "trashbin_retention_obligation": "30, 180",
        "memcache.local": "\\OC\\Memcache\\APCu",
        "memcache.distributed": "\\OC\\Memcache\\Redis",
        "memcache.locking": "\\OC\\Memcache\\Redis",
        "redis": {
            "host": "***REMOVED SENSITIVE VALUE***",
            "port": 0,
            "dbindex": 0,
            "password": "***REMOVED SENSITIVE VALUE***",
            "timeout": 1.5
        },
        "enabledPreviewProviders": [
            "OC\\Preview\\PNG",
            "OC\\Preview\\JPEG",
            "OC\\Preview\\GIF",
            "OC\\Preview\\HEIC",
            "OC\\Preview\\BMP",
            "OC\\Preview\\XBitmap",
            "OC\\Preview\\MP3",
            "OC\\Preview\\TXT",
            "OC\\Preview\\MarkDown",
            "OC\\Preview\\Movie",
            "OC\\Preview\\MKV",
            "OC\\Preview\\MP4",
            "OC\\Preview\\AVI"
        ],
        "preview_max_x": "2048",
        "preview_max_y": "2048",
        "jpeg_quality": "60",
        "updater.release.channel": "stable",
        "app_install_overwrite": [
            "fulltextsearch",
            "fulltextsearch_elasticsearch",
            "files_fulltextsearch",
            "files_fulltextsearch_tesseract"
        ],
        "default_language": "de",
        "default_locale": "de_DE",
        "default_phone_region": "DE",
        "allow_local_remote_servers": true,
        "activity_use_cached_mountpoints": true
    }
}

List of activated Apps

Enabled:
  - activity: 2.17.0
  - admin_audit: 1.15.0
  - bookmarks: 12.0.0
  - bruteforcesettings: 2.5.0
  - calendar: 4.1.0
  - circles: 25.0.0
  - cloud_federation_api: 1.8.0
  - collectives: 2.1.1
  - comments: 1.15.0
  - contacts: 5.0.1
  - contactsinteraction: 1.6.0
  - dashboard: 7.5.0
  - dav: 1.24.0
  - deck: 1.8.2
  - drop_account: 2.1.0
  - federatedfilesharing: 1.15.0
  - federation: 1.15.0
  - files: 1.20.1
  - files_accesscontrol: 1.15.0
  - files_automatedtagging: 1.15.0
  - files_external: 1.17.0
  - files_fulltextsearch: 25.0.0
  - files_fulltextsearch_tesseract: 25.0.0
  - files_pdfviewer: 2.6.0
  - files_retention: 1.14.0
  - files_rightclick: 1.4.0
  - files_sharing: 1.17.0
  - files_trashbin: 1.15.0
  - files_versions: 1.18.0
  - firstrunwizard: 2.14.0
  - fulltextsearch: 25.0.0
  - fulltextsearch_elasticsearch: 25.0.0
  - groupfolders: 13.1.0
  - keeweb: 0.6.10
  - logreader: 2.10.0
  - lookup_server_connector: 1.13.0
  - mail: 2.2.1
  - nextcloud_announcements: 1.14.0
  - notifications: 2.13.1
  - notify_push: 0.5.0
  - oauth2: 1.13.0
  - password_policy: 1.15.0
  - photos: 2.0.1
  - previewgenerator: 5.1.1
  - privacy: 1.9.0
  - provisioning_api: 1.15.0
  - recommendations: 1.4.0
  - related_resources: 1.0.3
  - richdocuments: 7.0.2
  - serverinfo: 1.15.0
  - settings: 1.7.0
  - sharebymail: 1.15.0
  - spreed: 15.0.2
  - support: 1.8.0
  - survey_client: 1.13.0
  - systemtags: 1.15.0
  - text: 3.6.0
  - theming: 2.0.1
  - twofactor_backupcodes: 1.14.0
  - twofactor_nextcloud_notification: 3.5.0
  - twofactor_totp: 7.0.0
  - updatenotification: 1.15.0
  - user_status: 1.5.0
  - viewer: 1.9.0
  - weather_status: 1.5.0
  - welcome: 1.0.2
  - workflowengine: 2.7.0
Disabled:
  - emlviewer: 1.0.4
  - encryption: 2.8.1
  - suspicious_login
  - user_ldap

Nextcloud Signing status

No errors have been found.

Nextcloud Logs

No response

Additional info

Postgres log shows slow search query for files and tags: slow_search.log

Mer0me commented 1 year ago

We have 600k lines on oc_filecache table and file search is almost unusable, taking more than 30 seconds to show results. On top of that, Nextcloud unified search is looking for exact words, in the same order as user input which is not very efficient.

So, we have decided to write our own search form, outside of Nextcloud (added to the app menu using external sites app, as we are not able to make a real Nextcloud App), but using the same database (mariadb here). We've added some extra search functionalities :

Search is based on the user profile, so only the files shared to, or owned by the connected user appear on the results.

Surprisingly, our search engine is significantly faster than the official nextcloud one. It takes about 2 seconds to display results. There is certainly a lot of possible optimizations to do.

Feel free to comment, use, adapt this bunch of code as we are not going to make an app or whatever with it. If it may helps someone...

<?php

/* 
    Nextcloud alternative search engine offering :
    - search by file id
    - search by user name
    - search between 2 dates
    - search all the input terms in various order
    - search is accent insensitive 
    - export results in CSV

    search results are based on the user in $_SESSION["username"] variable
    you have to write your own user authentication to make it works...

*/

session_start();

// function returning the icon corresponding to a mime type
function getImgExtension($fileName, $mime)
{
    $file = explode('.', $fileName);
    switch ($file[sizeof($file) - 1]) {
        case 'doc':
            return 'icons/doc.png';
        case 'docx':
            return 'icons/doc.png';
        case 'jpg':
            return 'icons/jpg.png';
        case 'jpeg':
            return 'icons/jpg.png';
        case 'mp3':
            return 'icons/mp3.png';
        case 'pdf':
            return 'icons/pdf.png';
        case 'png':
            return 'icons/png.png';
        case 'ppt':
            return 'icons/ppt.png';
        case 'pptx':
            return 'icons/ppt.png';
        case 'csv':
            return 'icons/csv.png';
        case 'csv':
            return 'icons/csv.png';
        case 'csvx':
            return 'icons/csv.png';
        case 'zip':
            return 'icons/zip.png';
        case '7z':
            return 'icons/zip.png';
        case 'eml':
            return 'icons/mail.png';
        case 'rar':
            return 'icons/rar.png';
        case 'km':
            return 'icons/mind.png';
        default:
            if ($mime == 2) {
                return 'icons/folder.png';
            } else {
                return 'icons/unknown.png';
            }
    }
}

// Get rid of the accented letters
function stripAccents($str) {
    return strtr(utf8_decode($str), utf8_decode("àáâãäçèéêëìíîïñòóôõöùúûüýÿÀÁÂÃÄÇÈÉÊËÌÍÎÏÑÒÓÔÕÖÙÚÛÜÝ'"), "aaaaaceeeeiiiinooooouuuuyyAAAAACEEEEIIIINOOOOOUUUUY ");
}

$bdd = mysqli_connect("host", "user", "password", "database");
if (!$bdd) {
    die('Connection error (' . mysqli_connect_errno() . ') '
        . mysqli_connect_error());
}

if (isset($_POST['documentName']) || isset($_POST['documentOwnerUsername'])) {
    // Constructing the query used to retrieve all visible mount points to the user, excluding trash_bin
    $sql = "SELECT `storage_id`, `root_id`, `user_id`, `mount_point`, `mount_id`, `f`.`path`, `mount_provider_class`,id FROM `oc_mounts` `m` INNER JOIN `oc_filecache` `f` ON `m`.`root_id` = `f`.`fileid` WHERE `user_id` = '" . $_SESSION["username"] . "' and path not like '%trashbin%' order by path asc";
    $mounts = mysqli_query($bdd, $sql);
    $paths = "(1=0";
    $storage="";
    while ($mount = mysqli_fetch_array($mounts)) {
        if ($storage != $mount["storage_id"] || strpos($mount["path"], $path) === false) {
            $paths .= " OR (path like '" . addslashes($mount["path"]) . "%' and storage=" . $mount["storage_id"] . " and oc_mounts.id=" . $mount["id"] . ")\n";
            $storage = $mount["storage_id"];
            $path = $mount["path"];
        }
    }
    $paths .= ")";

    // Constructing the query used to retrieve all files matching the search query (all terms must be present, in any order)
    $criteres = "((1=1";
    $debug = false;
    if ($_POST['documentName'] != "") {

        $termes = explode(" ", stripAccents(strtolower($_POST['documentName'])));

        foreach ($termes as $terme) {
            // lots of chained sql "replace" here because of normalization of filename
            $criteres .= " and (concat(' ',replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(lower(name),'é','e'),'è','e'),'ë','e'),'ê','e'),'ô','o'),'à','a'),'ï','i'),'î','i'),'â','a'),'ù','u'),'\'',' '),'.',' '),' ') like '% " . addslashes($terme) . " %'\n";
            // eventually removing trailing 's' to search term without plural
            if (substr($terme, -1) == "s") {
                $criteres .= " or concat(' ',replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(lower(name),'é','e'),'è','e'),'ë','e'),'ê','e'),'ô','o'),'à','a'),'ï','i'),'î','i'),'â','a'),'ù','u'),'\'',' '),'.',' '),' ') like '% " . addslashes(substr($terme, 0, -1)) . " %')\n";
            } else {
                // replacing trailing "aux" by "al", as plural of most word finishing by "al" is "aux" in french (cheval => chevaux)
                if (substr($terme, -3) == "aux") {
                    $criteres .= " or concat(' ',replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(lower(name),'é','e'),'è','e'),'ë','e'),'ê','e'),'ô','o'),'à','a'),'ï','i'),'î','i'),'â','a'),'ù','u'),'\'',' '),'.',' '),' ') like '% " . addslashes(substr($terme, 0, -3)) . "al %')\n";
                } 
                // adding a trailing 's' to search plural too
                else {
                    $criteres .= " or concat(' ',replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(lower(name),'é','e'),'è','e'),'ë','e'),'ê','e'),'ô','o'),'à','a'),'ï','i'),'î','i'),'â','a'),'ù','u'),'\'',' '),'.',' '),' ') like '% " . addslashes($terme) . "_ %')\n";
                }
            }
        }
    }
    // searching by fileid
    $criteres .= ") or fileid='" . addslashes($_POST["documentName"]) . "')";

    // base search query, excluding file versions and trashbin
    $sqlid = "select fileid from oc_filecache where " . $criteres . " and path not like 'files_versions%' and path not like 'files_trashbin%'";;
    // if dates have been specified, search only files modified between these dates
    if ($_POST["date"] != "") {
        $splDate = explode("&", $_POST['date']);
        $dStart = DateTime::createFromFormat('d-m-Y', $splDate[0]);
        $dEnd = DateTime::createFromFormat('d-m-Y', $splDate[1]);
        $sqlid .= " AND mtime BETWEEN " . $dStart->getTimestamp() . " AND " . $dEnd->getTimestamp();
    }
    $resid = mysqli_query($bdd, $sqlid);
    $listeid = "-1";
    // Constructing the list of matching id, regardless of the user right access for the moment
    while ($id = mysqli_fetch_array($resid)) {
        $listeid .= "," . $id["fileid"];
    }

    // Filtering results with visible mount points
    $sql = "select fileid,name,mtime,ldap_dn,mimetype,mount_point,path,oc_storages.id as proprio,parent from oc_filecache,oc_storages,oc_ldap_user_mapping,oc_mounts
              where storage=numeric_id and (owncloud_name=substring_index(oc_storages.id,'::',-1) or (storage = 192 and owncloud_name='adminsi')) and oc_mounts.user_id='" . $_SESSION["username"] . "' and
                " . $paths . " AND fileid in (" . $listeid . ")";

    // If a username has been specified, filtering by username
    if ($_POST["documentOwner"] != "") {
        $sql .= " AND owncloud_name='" . $_POST["documentOwner"] . "' ";
    }

    // ordering results by modification time
    $sql .= " order by mtime +0 desc";
    $res = mysqli_query($bdd, $sql);
    $nbdoc = mysqli_num_rows($res);
}

?>

<!doctype html>
<html lang="fr">

<head>
    <!-- Required meta tags -->
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">

    <!-- Bootstrap CSS -->
    <link href="bootstrap/css/bootstrap.min.css" rel="stylesheet">
    <link href="transition.css" rel="stylesheet">
    <link rel="stylesheet" type="text/css" href="https://cdn.jsdelivr.net/npm/daterangepicker/daterangepicker.css" />
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@tarekraafat/autocomplete.js@10.2.7/dist/css/autoComplete.min.css">
    <title>Moteur de recherche cloud PMA</title>
</head>

<body id="body">
    <?php

    if (isset($_SESSION['username'])) { ?>
        <div class="container">
            <div class="row">
                <a href="logout.php" style="text-align: right; color: grey;">Disconnect</a>
            </div>
            <div class="row">
                <h1>Advanced search</h1>
            </div>
            <form method="POST" autocomplete="off">
                <div class="row mt-2">
                    <div class="row">
                        <div class="col mt-3">
                            <label for="documentName" class="form-label" id="documentNameLabel">Filename or file id</label>
                            <input type="text" class="form-control" id="documentName" placeholder="space separated terms" name="documentName" value="<?php echo (isset($_POST['documentName']) ? $_POST['documentName'] : '') ?>">
                        </div>
                        <div class="col mt-3">
                            <label for="documentOwner" class="form-label">Owner</label>
                            <input type="text" name="documentOwner" id="documentOwner" class="form-control">
                            <span id="search_result"></span>
                        </div>
                        <div class="col m-3">
                            <div class="row">
                                <div class="col">
                                    <label for="date" class="form-label">Modification time</label>
                                    <div id="reportrange" style="background: #fff; cursor: pointer; padding: 5px 10px; border: 1px solid #ccc; width: 100%">
                                        <i class="fa fa-calendar"></i>&nbsp;
                                        <span></span> <i class="fa fa-caret-down"></i>
                                    </div>
                                    <input type="hidden" value="" name="date" id="date">
                                </div>
                            </div>
                        </div>
                    </div>
                    <div class="row">
                        <div class="col">
                            <input class="btn btn-primary" type="submit" value="Rechercher">
                        </div>
                        <div class="col-10"></div>
                    </div>
            </form>
            <div class="row mt-4">
                <hr style="color:#932D30">
            </div>
            <?php if (isset($nbdoc)) { ?>
                <div class="row mt-3">
                    <?= $nbdoc ?> result(s)
                </div>
                <div class="row mt-5 align-center">
                    <div class="col">
                        <?php
                        $_SESSION["csv"] = "CID;Filename;Owner;Modification time;Path\n";
                        // We use a LDAP backend, if you don't, you should modify this
                        foreach ($res as $r) {
                            $owner = explode(",", $r['ldap_dn'], 2);
                            $owner = str_replace("cn=", "", $owner[0]);
                        ?>
                            <div class="card mt-2 mb-2">
                                <a href="<?= 'https://yourcloud.com/index.php/f/' . $r['fileid'] ?>" target="_blank">
                                    <div class="card-body">
                                        <h5 class="card-title"><img src="<?= getImgExtension($r['name'], $r['mimetype']) ?>" alt="" width="30" height="30">&nbsp;<?= $r['name'] ?></h5>
                                        <p class="card-text"><span style="color: 808080;">
                                                <?php

                                                // Computing the file path for this user when possible
                                                $tchemin = explode("/", $r["path"]);
                                                $chemin = "";
                                                if (strpos($r["path"], "groupfolders") !== false) {
                                                    for ($i = 2; $i < sizeof($tchemin) - 1; $i++) {
                                                        $chemin .= $tchemin[$i] . "/";
                                                    }
                                                } else {
                                                    $correspondance = false;
                                                    for ($i = 1; $i < sizeof($tchemin) - 1; $i++) {
                                                        $chemin .= $tchemin[$i] . "/";
                                                        if ($p[1] != $_SESSION["username"] && $tchemin[$i] == substr($r['mount_point'], strlen($_SESSION["username"]) + 8, -1)) {
                                                            $chemin = "";
                                                            $correspondance = true;
                                                        }
                                                    }
                                                }
                                                // If the shared file have been renamed, we cannot find the real path
                                                if ($chemin == "") $chemin = "/";
                                                if ($r["name"] == substr($r['mount_point'], strlen($_SESSION["username"]) + 8, -1)) {
                                                    $dans = " - <a href=\"https://yourcloud.com/index.php/apps/files/\">Root folder</a>";
                                                    $danscsv = "Root folder";
                                                } else {
                                                    $dans = " - <a href=\"https://yourcloud.com/index.php/f/" . $r['parent'] . "\" target=\"_blank\">In the folder " . substr($r['mount_point'], strlen($_SESSION["username"]) + 8) . $chemin . "</a>";
                                                    $danscsv = "In the folder " . substr($r['mount_point'], strlen($_SESSION["username"]) + 8) . $chemin;
                                                }
                                                if ($p[1] != $_SESSION["username"] && !$correspondance && strpos($r["path"], "groupfolders") === false) {
                                                    $dans = " - <a href=\"https://yourcloud.com/index.php/f/" . $r['parent'] . "\" target=\"_blank\">In the folder " . substr($r['mount_point'], strlen($_SESSION["username"]) + 8) . ".../" . $tchemin[sizeof($tchemin) - 2] . "/</a>";
                                                    $danscsv = "In the folder " . substr($r['mount_point'], strlen($_SESSION["username"]) + 8) . ".../" . $tchemin[sizeof($tchemin) - 2] . "/";
                                                }
                                                echo ($owner == "admin") ? "Group folder" : mb_strtoupper($owner) ?> - Last modification : <?= date('d/m/Y H:i', $r['mtime']); ?> <?= $dans ?></span> </p>
                                    </div>
                                </a>
                            </div>
                        <?php
                            $_SESSION["csv"] .= $r['fileid'] . ";" . $r['name'] . ";" . (($owner == "admin") ? "Group folder" : mb_strtoupper($owner)) . ";" . date('d/m/Y H:i', $r['mtime']) . ";" . $danscsv . "\n";
                        }
                        // you may need to write this script
                        echo ("<a href=\"csv.php\">Click here to get this list in CSV format</a>");

                        ?>
                    </div>
                </div>
            <?php } ?>
        </div>
        </div>
    <?php } else {

            // You should write here your own login form
            $_SESSION["username"]="sample_user";

     } ?>

    <script type="text/javascript" src="js/jquery.min.js"></script>
    <script type="text/javascript" src="js/moment.min.js"></script>
    <script type="text/javascript" src="js/daterangepicker.min.js"></script>
    <script type="text/javascript" src="js/autoComplete.min.js"></script>
    <script type="text/javascript">
        $(function() {

            // double date picker

            var start = moment().subtract(365, 'days');
            var end = moment();

            function cb(start, end) {
                $('#reportrange span').html(start.format('DD/MM/YYYY') + ' - ' + end.format('DD/MM/YYYY'));
                document.getElementById('date').value = start.format('DD-MM-YYYY') + '&' + end.format('DD-MM-YYYY');
            }

            $('#reportrange').daterangepicker({
                startDate: start,
                endDate: end,
                ranges: {
                    "Today": [moment(), moment()],
                    'Last 30 days': [moment().subtract(29, 'days'), moment()],
                    'This month': [moment().startOf('month'), moment().endOf('month')],
                    'Last month': [moment().subtract(1, 'month').startOf('month'), moment().subtract(1, 'month').endOf('month')],
                    '3 last months': [moment().subtract(3, 'month'), moment()],
                    'This year': [moment().startOf('year'), moment().endOf('year')],
                    'Last year': [moment().subtract(365, 'days'), moment()],
                    'Whenever': [moment().subtract(100, 'years'), moment()]
                }
            }, cb);

            cb(start, end);

        });
    </script>
    </div>
</body>

</html>
Mer0me commented 1 year ago

Here is a search example with terms "numérique collectivité" performed in 2 seconds. You can see in the results list some filenames where terms are not in the same order than in the user input. You can see also that a filename containing "collectivités" with a trailing 's' is retrieved as well :

image

With nextcloud official unified search, these files are not retrieved, and it takes about a minute to finish...

XueSheng-GIT commented 1 year ago

Thanks a lot @Mer0me for sharing your investigations and your approach to bypass the issue. Search is really unusable if the users share a lot of files. Is there any specific reason why you are not interested in "fixing" the issue of unified search itself?

Mer0me commented 1 year ago

Thank you for your comment.

We were able to write a quick and dirty search engine to address our particular needs, but I'm not sure (read : I'm sure of the opposite) this approach is scalable and good enough to be used as a Nextcloud unified search replacement.

I'm glad to help the Nextcloud community to find solutions but :

But if this contribution can help to find why Nextcloud search engine is so slow, and if I can personally help to improve it, I will certainly do.

gennaios commented 1 year ago

@Mer0me I’m very happy that you have commented about current issues of search and have done something about it. Of another thread with users asking for similar things, see #29614.

Agree with you on various points such as specifying search terms in any order, searching multiple fields, near instant results, etc. I think one should also be able to sort such by relevance, file name, date, etc. and have search re-sort near instantly.

Perhaps you wish to look into MySQL FTS. Such offers exactly what you wish for. Plus diacritic insensitive search so there is no need to remove accents, diacritics, or manage such. I think the main reason for such not being added yet is perhaps the files portion of Nextcloud has been somewhat neglected as more features are added to increase market share, profit, and merely keeping the business open, plus likely responding to competitors features by adding similar. Perhaps too that since Nextcloud supports multiple databases, such would have to be added for all DBs: SQLite, MySQL/MariaDB, and PostgreSQL. All offer such full-text search, not to be confused with FTS plugins and searching text contents, yet the code for each is different and perhaps no one on the team has done such before, is interested, or considers such important enough. My impression is that perhaps Nextcloud has no one on their team with decent enough database experience.

See ownCloud OCIS for an example of decent search implementation. It shows up in a full window, etc. It’s as if no one in the Nextcloud team uses file search on their own desktop, being able to search an entire file system with instant results, and wants the same on Nextcloud. I store 100+ GB of files and searching now really is what do I even say, for how long is it going to remain so poor? :)

https://dev.mysql.com/doc/refman/8.0/en/fulltext-boolean.html https://dev.mysql.com/doc/refman/8.0/en/fulltext-natural-language.html

Mer0me commented 1 year ago

Unified search of the last versions of Nextcloud (since 25.0 maybe) is now very quick. We always need to search terms in the same order, but the results are near instantly displayed, so it makes sense again.

gennaios commented 1 year ago

Unified search of the last versions of Nextcloud (since 25.0 maybe) is now very quick. We always need to search terms in the same order, but the results are near instantly displayed, so it makes sense again.

Perhaps their way to make it seem quick is merely to show the first 5 results, and then require the user to select 'Load more results' each time to see another five.

Doesn't anyone have tens of thousands of files or more, and with a particular search string may potentially have dozens, hundreds, or more search results, and wants to see them all at once? Imagine Google, Amazon, your own file system, your contacts list on computer or phone, if all of them acted in the same way, showing five results, and then requiring you to select more each and everytime. Really doesn't anyone see the madness?

XueSheng-GIT commented 1 year ago

I'm currently testing https://github.com/nextcloud/server/pull/37061. This does really speed up things. I've posted my webdav-search results over there. Also for unified search the results are amazing. On my test instance, search timed out (>60s, gateway time-out). After applying the patch, search takes less than 2 seconds! I also did some tests on my productive system with an speed up from 42s to less than 2s.

icewind1991 commented 11 months ago

I would appreciate it if people can also test https://github.com/nextcloud/server/pull/40555, it should achieve similar effects to https://github.com/nextcloud/server/pull/37061

XueSheng-GIT commented 11 months ago

@icewind1991 thanks for your efforts!

I did a quick test on one of my instances (updated to NC 27.1.1) and created 1,000 empty text files and shared them via talk. https://github.com/nextcloud/server/pull/40555 was not applied 100% because SearchBuilderTest.php doesn't seem to exist on 27.1.1.

Patch Search time
Default (no patch): 37,594 ms
PR https://github.com/nextcloud/server/pull/37061 447 ms
PR https://github.com/nextcloud/server/pull/40555 887 ms

https://github.com/nextcloud/server/pull/37061 is finally faster in this scenario, but https://github.com/nextcloud/server/pull/40555 is still a major improvement compared to the default installation (42 times faster).

UPDATE: Forgot to add the logs to the search times above... query pattern_no patch (NC2711).log query pattern_PR37061.log query pattern_PR40555.log

icewind1991 commented 11 months ago

@XueSheng-GIT thanks for the testing.

The resulting query isn't quite what I expected (I would expect to see path IN (...) instead of path_hash IN (...)) which could explain the difference in performance between the two patches.

Can you try applying the latest commit I pushed to the PR and see if that makes a difference

XueSheng-GIT commented 11 months ago

@icewind1991 I'm just only on my mobile. Thus, a bit limited with testing. But here you go with the updated PR (I did a new run for all variants):

Patch Search time
Default (no patch): 38,295 ms
PR https://github.com/nextcloud/server/pull/37061 543 ms
PR https://github.com/nextcloud/server/pull/40555 (updated) 537 ms

Seems to be a quite good speedup for the updated PR https://github.com/nextcloud/server/pull/40555

Logs to the search times above... query pattern_no-patch-2.log query pattern_PR37061-2.log query pattern_PR40555-2.log

Does the query now look like intended?

icewind1991 commented 11 months ago

Yes, this looks as expected. Thanks again for the testing

XueSheng-GIT commented 11 months ago

@icewind1991 I did some further testing on one of my productive systems (NC27.1.1) which was always slow on search before using https://github.com/nextcloud/server/pull/37061 and wasn't able to notice any search speedup using https://github.com/nextcloud/server/pull/40555 (same patch version as used for previous test).

Patch Search time
Default (no patch): 33,260 ms
PR https://github.com/nextcloud/server/pull/37061 467 ms
PR https://github.com/nextcloud/server/pull/40555 (updated) 30,047 ms

Logs to the search times above... query pattern_no-patch.log query pattern_PR37061.log query pattern_PR40555.log

Any idea why there's no speedup in this case?

XueSheng-GIT commented 10 months ago

@icewind1991 Any idea why search is slow in my latest test https://github.com/nextcloud/server/issues/35776#issuecomment-1732223031?

XueSheng-GIT commented 8 months ago

Unfortunately on NC28 the patches provided in https://github.com/nextcloud/server/pull/37061 or https://github.com/nextcloud/server/pull/40555 are not compatible anymore. Default search of NC28 is still slow if a lot of shares are present.

Whereas I used https://github.com/nextcloud/server/pull/37061 in production because https://github.com/nextcloud/server/pull/40555 was still slow in some cases (see comment above https://github.com/nextcloud/server/issues/35776#issuecomment-1732223031)

@icewind1991 @starypatyk any plans to update the query optimization for NC28?

starypatyk commented 8 months ago

@XueSheng-GIT - I put my PR on hold in favor of #40555. Now, I do not know what to do next, as #40555 did not progress since September. :disappointed:

XueSheng-GIT commented 5 months ago

Now that https://github.com/nextcloud/server/pull/40555 was merged into master, I did some further testing on NC28.0.3rc2. Long story short, as already observed above (https://github.com/nextcloud/server/issues/35776#issuecomment-1732223031), https://github.com/nextcloud/server/pull/40555 does not improve things on my production instance. Whereas https://github.com/nextcloud/server/pull/37061 does.

Patch Search time
Default (no patch): 33,689 ms
PR https://github.com/nextcloud/server/pull/37061 (updated to NC28) 520 ms
PR https://github.com/nextcloud/server/pull/40555 38,547 ms

Logs to the search times above... query pattern_nc28_no-patch.log query pattern_nc28_PR37061.log query pattern_nc28_PR40555.log

@icewind1991 Thanks a lot for taking care of this matter and merging https://github.com/nextcloud/server/pull/40555 into master! I'm not really into detail about the approach of https://github.com/nextcloud/server/pull/40555, but on the first view https://github.com/nextcloud/server/pull/37061 does include the following pattern which may be the reason for the difference:

(("storage" = $25) AND ("path_hash" IN ($26, $27, ...)))

CC @starypatyk because it was his approach. Maybe he has some additional idea why https://github.com/nextcloud/server/pull/37061 is so much faster than https://github.com/nextcloud/server/pull/40555.

starypatyk commented 5 months ago

@XueSheng-GIT Thanks for your tests. :+1:

Indeed the query should be optimized as you describe, but for some reason this does not happen.

@icewind1991 I created a few simple tests that show the issue - please see my branch https://github.com/nextcloud/server/commits/query-optimizer-search-issue/. I am not sure, if I should create a PR from this branch, as it contains failing tests only. Feel free to use these tests, if you think they are valuable.

One of the problems is shown by a pair of tests: testComplexSearchPattern1 and testComplexSearchPattern2. The query condition in the first one gets optimized correctly. The condition in the second one is logically equivalent, but the first clause storage eq 1 is simplified - it is not wrapped in a (redundant) AND operator. This one is not optimized as expected.

Apparently the code in https://github.com/nextcloud/server/blob/34c9bfc767e1bafa7c708318e4050a5f4c9a772b/lib/private/Files/Search/QueryOptimizer/MergeDistributiveOperations.php#L23 is not prepared for such cases.

Two additional tests mimic the query condition created in the QuerySearchHelper::applySearchConstraints method.

The testApplySearchConstraints1 method tries to behave exactly like the QuerySearchHelper code.

In the second one testApplySearchConstraints2 I tried to wrap the storage eq 2 clause in an AND operator, to make all elements of the higher-level OR homogenous. Unfortunately this did not help - the query is still not optimized as expected.

icewind1991 commented 5 months ago

Thanks for the testing, I'll try to look into things further.

icewind1991 commented 5 months ago

https://github.com/nextcloud/server/pull/43975 fixes those tests

XueSheng-GIT commented 5 months ago

@icewind1991 Thanks for following up! Just did some further testing and your additional pull https://github.com/nextcloud/server/pull/43975 seems to do the trick on my instances!

For the sake of completeness, here are the new results on NC28.0.3.

Patch Search time
Default (no patch): 35,697 ms
PR https://github.com/nextcloud/server/pull/37061 (updated to NC28) 345 ms
PR https://github.com/nextcloud/server/pull/40555 + https://github.com/nextcloud/server/pull/43975 359 ms

Logs to the search times above... query pattern_nc28_no-patch.log query pattern_nc28_PR37061.log query pattern_nc28_PR40555+43975.log

I would say nothing to complain anymore. Seems we can close this issue once https://github.com/nextcloud/server/pull/43975 is merged. A backport to NC28 would be welcome (although I'm already used to the manual patches 😉).

Thanks again @starypatyk and @icewind1991 for your great work!