The idea of the change is to substitute linear scan of extents table for lookup in it in block dedupe phase.
Here are the query explanations by sqlite:
$ sqlite3 /tmp/foo.db
sqlite> .eqp on
Before the change:
sqlite> SELECT extents.loff, len, poff, flags
FROM extents JOIN (
SELECT digest
FROM extents
GROUP BY digest
HAVING count(*) = 1) AS nondupe_extents
on extents.digest = nondupe_extents.digest
where extents.ino = ?1 and extents.subvol = ?2;
QUERY PLAN
|--CO-ROUTINE nondupe_extents
| `--SCAN extents USING COVERING INDEX idx_extent_digest
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
|--BLOOM FILTER ON nondupe_extents (digest=?)
`--SEARCH nondupe_extents USING AUTOMATIC COVERING INDEX (digest=?)
Note the SCAN extents part that does full database scan.
After the change:
sqlite> SELECT extents.loff, len, poff, flags
FROM extents
where extents.ino = ?1 and extents.subvol = ?2 and (
1 = (SELECT COUNT(*)
FROM extents as e
where e.digest = extents.digest));
QUERY PLAN
|--SEARCH extents USING INDEX idx_extents_inosub (ino=? AND subvol=?)
`--CORRELATED SCALAR SUBQUERY 1
`--SEARCH e USING COVERING INDEX idx_extent_digest (digest=?)
Note that there are no scans any more and all we do are index lookups here.
This turns the --dedupe-options=partial mode from quadratic complexity down to something closer to n*log(n).
The idea of the change is to substitute linear scan of
extents
table for lookup in it in block dedupe phase.Here are the query explanations by
sqlite
:Before the change:
Note the
SCAN extents
part that does full database scan.After the change:
Note that there are no scans any more and all we do are index lookups here.
This turns the
--dedupe-options=partial
mode from quadratic complexity down to something closer ton*log(n)
.The benchmark on real input:
/nix/store/.links
contains 2 million files (177G on disk).After
duperemove
run ot has the following stats:Before: real 154m21,916s (after the interruption)
After: real 11m45,184s user 16m54,377s sys 34m56,864s
I had no patience to wait for the unpatched run to finish and interrupted it after a few hours.