markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
689 stars 75 forks source link

`--dedupe-options=partial`: avoid quadratic slowdown on extent count #324

Closed trofi closed 8 months ago

trofi commented 8 months ago

The idea of the change is to substitute linear scan of extents table for lookup in it in block dedupe phase.

Here are the query explanations by sqlite:

$ sqlite3 /tmp/foo.db
sqlite> .eqp on

Before the change:

sqlite> SELECT extents.loff, len, poff, flags
       FROM extents JOIN (
         SELECT digest
         FROM extents
         GROUP BY digest
         HAVING count(*) = 1) AS nondupe_extents
       on extents.digest =  nondupe_extents.digest
       where extents.ino = ?1 and extents.subvol = ?2;

QUERY PLAN
|--CO-ROUTINE nondupe_extents
|  `--SCAN extents USING COVERING INDEX idx_extent_digest
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
|--BLOOM FILTER ON nondupe_extents (digest=?)
`--SEARCH nondupe_extents USING AUTOMATIC COVERING INDEX (digest=?)

Note the SCAN extents part that does full database scan.

After the change:

sqlite> SELECT extents.loff, len, poff, flags
        FROM extents
        where extents.ino = ?1 and extents.subvol = ?2 and (
          1 = (SELECT COUNT(*)
              FROM extents as e
              where e.digest = extents.digest));

QUERY PLAN
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
`--CORRELATED SCALAR SUBQUERY 1
   `--SEARCH e USING COVERING INDEX idx_extent_digest (digest=?)

Note that there are no scans any more and all we do are index lookups here.

This turns the --dedupe-options=partial mode from quadratic complexity down to something closer to n*log(n).

The benchmark on real input:

$ SQLITE_TMPDIR=/tmp \
./duperemove \
    -rd \
    \
    --batchsize=100000000 \
    --dedupe-options=partial,same \
    --hashfile=/tmp/foo.db
/nix/store/.links

/nix/store/.links contains 2 million files (177G on disk).

After duperemove run ot has the following stats:

$ sqlite3 /tmp/foo.db
sqlite> select count(*) from extents;
1867387
sqlite> select count(*) from files;
1962135

Before: real 154m21,916s (after the interruption)

After: real 11m45,184s user 16m54,377s sys 34m56,864s

I had no patience to wait for the unpatched run to finish and interrupted it after a few hours.

JackSlateur commented 8 months ago

Thank you!