nextcloud / files_antivirus

👾 Antivirus app for Nextcloud Files
https://apps.nextcloud.com/apps/files_antivirus
GNU Affero General Public License v3.0
84 stars 35 forks source link

Slow SQL query - proposal for improvement #177

Closed bjoernwuest closed 1 year ago

bjoernwuest commented 3 years ago

Hello,

I am running Nextcloud 19.0.4 on local machine with PHP7.4 and PostgreSQL 12.5

Both, NC and PG with data base files are hosted on RAID1 SSD connected via PCIe3.0 .

When running the cron.php, there is one very long running SQL query that takes approx. 17 hours to complete.

SELECT "fileid", "storage" FROM "oc_filecache" "fc" WHERE ("fileid" NOT IN (SELECT "fileid" FROM "oc_files_antivirus")) AND ("mimetype" <> '2') AND ("path" LIKE 'files/%') AND ("fc"."size" <> '0') LIMIT 1000;

If I understand this query correctly, I herewith propose to replace it with the following query, which does return in approx. 2 seconds:

SELECT "fc"."fileid", "fc"."storage" FROM "oc_filecache" "fc" LEFT JOIN "oc_files_antivirus" "fa" ON "fc"."fileid" = "fa"."fileid" WHERE "fa"."fileid" IS NULL AND "fc"."mimetype" <> 2 AND "fc"."path" LIKE 'files/%' AND "fc"."size" <> '0' LIMIT 1000;

kesselb commented 3 years ago

cc @rullzer

rullzer commented 3 years ago

I moved this to the files_antivirus. As it is originating from here.

Would you be up to try to do a PR that fixes this?

xkill commented 3 years ago

Hi, I was checking it and it should not be a big difference between NOT IN () and LEFT JOIN NULL:

> select count(*) from oc_filecache;
+----------+
| count(*) |
+----------+
|  1160393 |
+----------+
1 row in set (0.205 sec)

> select count(*) from oc_files_antivirus;
+----------+
| count(*) |
+----------+
|    24983 |
+----------+
1 row in set (0.004 sec)

MariaDB [owncloud]> SHOW VARIABLES LIKE 'query_cache_type'; +------------------+-------+ | Variable_name | Value | +------------------+-------+ | query_cache_type | OFF | +------------------+-------+ 1 row in set (0.001 sec)


* NOT IN:

MariaDB [owncloud]> SELECT fileid, storage FROM oc_filecache fc WHERE (fileid NOT IN (SELECT fileid FROM oc_files_antivirus)) AND (mimetype <> '2') AND (path LIKE 'files/%') AND (fc.size <> '0') LIMIT 1000; [...] 1000 rows in set (0.068 sec)

MariaDB [owncloud]> EXPLAIN EXTENDED SELECT fileid, storage FROM oc_filecache fc WHERE (fileid NOT IN (SELECT fileid FROM oc_files_antivirus)) AND (mimetype <> '2') AND (path LIKE 'files/%') AND (fc.size <> '0') LIMIT 1000; +------+--------------+--------------------+-------+---------------+------------+---------+------+---------+----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+--------------+--------------------+-------+---------------+------------+---------+------+---------+----------+-------------+ | 1 | PRIMARY | fc | ALL | NULL | NULL | NULL | NULL | 1114737 | 100.00 | Using where | | 2 | MATERIALIZED | oc_files_antivirus | index | PRIMARY | check_time | 4 | NULL | 22798 | 100.00 | Using index | +------+--------------+--------------------+-------+---------------+------------+---------+------+---------+----------+-------------+ 2 rows in set, 1 warning (0.000 sec)


* JOIN:

MariaDB [owncloud]> SELECT fc.fileid, fc.storage FROM oc_filecache fc LEFT JOIN oc_files_antivirus fa ON fc.fileid = fa.fileid WHERE fa.fileid IS NULL AND fc.mimetype <> 2 AND fc.path LIKE 'files/%' AND fc.size <> '0' LIMIT 1000; [...] 1000 rows in set (0.081 sec)

MariaDB [owncloud]> EXPLAIN EXTENDED SELECT fc.fileid, fc.storage FROM oc_filecache fc LEFT JOIN oc_files_antivirus fa ON fc.fileid = fa.fileid WHERE fa.fileid IS NULL AND fc.mimetype <> 2 AND fc.path LIKE 'files/%' AND fc.size <> '0' LIMIT 1000; +------+-------------+-------+--------+---------------+---------+---------+--------------------+---------+----------+--------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +------+-------------+-------+--------+---------------+---------+---------+--------------------+---------+----------+--------------------------------------+ | 1 | SIMPLE | fc | ALL | NULL | NULL | NULL | NULL | 1114737 | 100.00 | Using where | | 1 | SIMPLE | fa | eq_ref | PRIMARY | PRIMARY | 8 | owncloud.fc.fileid | 1 | 100.00 | Using where; Using index; Not exists | +------+-------------+-------+--------+---------------+---------+---------+--------------------+---------+----------+--------------------------------------+ 2 rows in set, 1 warning (0.000 sec)



I ran the queries several times to ensure that the times are ok, and all the results are very close the the results above:
* `NOT IN` is a little bit faster than `LEFT JOIN NULL`

More info:
https://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/

@champli can you run the queries again disabling the query cache? Can you run the queries without the cache disabled and provide the result of the `EXPLAIN query`?
(I'm not sure, but maybe I added a index into the antivirus files you don't have)

EDIT: NOTE:
I'm using MariaDB 10.5.8.
On a shared hosting with several different databases running on a container into an i7 with RAID 10 SATA. note that the VM with more IO is a "Suricata -> Elastic search" and a graylog instance collecting logs from >20 VM on a live environment.
bjoernwuest commented 3 years ago

Hi,

I have done the steps as you have done, but with the difference that my NC runs on postgres. To my rare knowledge, postgresql does not has a query cache to disable / flush.

Here are the results.

SELECT count(*) FROM oc_filecache;

=> 569491

SELECT count(*) FROM oc_files_antivirus;

=> 444047

(the JOIN variant I had proposed) EXPLAIN(buffers, analyse, verbose) SELECT fc.fileid, fc.storage FROM oc_filecache AS fc LEFT JOIN oc_files_antivirus AS fa ON fc.fileid = fa.fileid WHERE fa.fileid IS NULL AND fc.mimetype <> 2 AND fc.path LIKE 'files/%' AND fc.size <> '0' LIMIT 1000;

Limit (cost=0.85..1494.80 rows=1000 width=16) (actual time=120.139..631.497 rows=504 loops=1) " Output: fc.fileid, fc.storage" Buffers: shared hit=197445 read=37489 written=46 -> Merge Anti Join (cost=0.85..130587.41 rows=87410 width=16) (actual time=120.138..631.442 rows=504 loops=1) " Output: fc.fileid, fc.storage" Merge Cond: (fc.fileid = fa.fileid) Buffers: shared hit=197445 read=37489 written=46 -> Index Scan using oc_filecache_pkey on public.oc_filecache fc (cost=0.42..104908.14 rows=396847 width=16) (actual time=0.094..368.406 rows=434402 loops=1) " Output: fc.fileid, fc.storage, fc.path, fc.path_hash, fc.parent, fc.name, fc.mimetype, fc.mimepart, fc.size, fc.mtime, fc.storage_mtime, fc.encrypted, fc.unencrypted_size, fc.etag, fc.permissions, fc.checksum" Filter: ((fc.mimetype <> 2) AND ((fc.path)::text ~~ 'files/%'::text) AND (fc.size <> '0'::bigint)) Rows Removed by Filter: 135089 Buffers: shared hit=155812 read=35880 written=44 -> Index Only Scan using oc_files_antivirus_pkey on public.oc_files_antivirus fa (cost=0.42..20483.06 rows=443890 width=4) (actual time=0.017..156.265 rows=444047 loops=1) Output: fa.fileid Heap Fetches: 444047 Buffers: shared hit=41633 read=1609 written=2 Planning Time: 0.494 ms Execution Time: 631.577 ms

(a slightly more advanced version of the JOIN variant that shall save some bytes of memory but not runtime) EXPLAIN(buffers, analyse, verbose) SELECT fc.fileid, fc.storage FROM (SELECT fileid, storage FROM oc_filecache WHERE mimetype <> 2 AND path LIKE 'files/%' AND size <> '0') AS fc LEFT JOIN oc_files_antivirus AS fa ON fc.fileid = fa.fileid WHERE fa.fileid IS NULL LIMIT 1000;

Limit (cost=0.85..1494.80 rows=1000 width=16) (actual time=117.159..629.489 rows=504 loops=1) " Output: oc_filecache.fileid, oc_filecache.storage" Buffers: shared hit=195126 read=39808 written=7 -> Merge Anti Join (cost=0.85..130587.41 rows=87410 width=16) (actual time=117.158..629.436 rows=504 loops=1) " Output: oc_filecache.fileid, oc_filecache.storage" Merge Cond: (oc_filecache.fileid = fa.fileid) Buffers: shared hit=195126 read=39808 written=7 -> Index Scan using oc_filecache_pkey on public.oc_filecache (cost=0.42..104908.14 rows=396847 width=16) (actual time=0.082..366.308 rows=434402 loops=1) " Output: oc_filecache.fileid, oc_filecache.storage, oc_filecache.path, oc_filecache.path_hash, oc_filecache.parent, oc_filecache.name, oc_filecache.mimetype, oc_filecache.mimepart, oc_filecache.size, oc_filecache.mtime, oc_filecache.storage_mtime, oc_filecache.encrypted, oc_filecache.unencrypted_size, oc_filecache.etag, oc_filecache.permissions, oc_filecache.checksum" Filter: ((oc_filecache.mimetype <> 2) AND ((oc_filecache.path)::text ~~ 'files/%'::text) AND (oc_filecache.size <> '0'::bigint)) Rows Removed by Filter: 135089 Buffers: shared hit=153707 read=37985 written=7 -> Index Only Scan using oc_files_antivirus_pkey on public.oc_files_antivirus fa (cost=0.42..20483.06 rows=443890 width=4) (actual time=0.010..155.933 rows=444047 loops=1) Output: fa.fileid Heap Fetches: 444047 Buffers: shared hit=41419 read=1823 Planning Time: 0.267 ms Execution Time: 629.556 ms

(the original from NC) EXPLAIN(buffers, analyse, verbose) SELECT fileid, storage FROM oc_filecache AS fc WHERE (fileid NOT IN (SELECT fileid FROM oc_files_antivirus)) AND (mimetype <> '2') AND (path LIKE 'files/%') AND (fc.size <> '0') LIMIT 1000;

sql1 sql2 (see attached images sql1.png and sql2.png since pgadmin did not permit copy&paste)

Since you had added a link to a comparison of those solutions in MySQL / MariaDB, I have done the same for postgres: https://explainextended.com/2009/09/16/not-in-vs-not-exists-vs-left-join-is-null-postgresql/

There, in the NOT IN section in the last two paragraphes, you can find:

"Since PostgreSQL cannot flush a hashed subplan onto the disk, it will estimate the subquery size, and if it decides that it will not fit into work_mem, it will resort to using a mere subplan which will be either executed for each row from t_left, or materialized and the rows will be searched for in a loop. This can be very bad since the optimizer will just change the plan as t_right will overgrow a certain limit and one day the query will just become very slow without any apparent reason."

Exactly this is what happens in my situation: the "NOT IN SELECT ..." gets executed ~225.000 times!

Thus, your query seems absolutely fine for MySQL/MariaDB but unfortunately not for postgres.

kesselb commented 3 years ago

Here is the code: https://github.com/nextcloud/files_antivirus/blob/master/lib/BackgroundJob/BackgroundScanner.php#L266

The getToRescanFiles method is already using a join. The query builder has a leftJoin method you can use. Looks fine to me to use a left join here.