mirage / irmin

Irmin is a distributed database that follows the same design principles as Git
https://irmin.org
ISC License
1.83k stars 153 forks source link

Collection of IO and GC improvements #2039

Open Ngoguey42 opened 2 years ago

Ngoguey42 commented 2 years ago

@metanivek and I brainstormed areas of improvement for GC and the new IO


Benchmark impact of GC on main process performances

benchmark status: ongoing, @Ngoguey42

Bench 1: Evaluate the impact of GCed pack store on replay speed. Could be done by comparing a gc-less replay starting from a non-gd'ed store versus a gc-less replay starting from a gd'ed store.

Bench 2: Evaluate the impact of GC-worker on main store. Could be done by comparing a gc-less replay versus a replay that constantly has GCs running but that never swaps.

Bench 1 will tell us if Add lower layer is worth it (ignoring the fact that the upper/lower could live on different disks)

Bench 2 will tell us if Use a sequential traverse that visits pages only once is worth it

New stats for stats trace

benchmark status: ongoing, @Ngoguey42

Benchmark the impact of fsync on other processes depending on filesystem

benchmark status: blocked, needs GC benchmark

Some filesystems may block all operations from all processes during an fsync


Review uses of finalise

code-correctness status: TODO

Need to close FDs on error

https://github.com/mirage/irmin/issues/1957

Improve crash consistency

code-correctness status: large unscheduled work

https://github.com/mirage/irmin/issues/2082

Catch decoding errors to reraise/return clean exceptions/errors

code-correctness status: TODO

Some irmin-pack function are expected to not raise errors, but when using repr's decode bin we don't catch it's exception.

Let's check the whole file stack + GC code for such errors.

See: https://github.com/mirage/irmin/blob/main/src/irmin-pack/unix/traverse_pack_file.ml#L209


Log to disk all the important activities on a pack store

forensic status: unscheduled and low priority

We could add a parameter to irmin-pack's repo that could default to journaling=false. We set journaling=true in Tezos.

Maybe using logs.

https://github.com/mirage/irmin/issues/1856


Change GC algorithm perform graph traversal from high to low offsets

reduce-gc-impact-on-main-process, improve-gc-worker-runtime status: Ongoing, @art-w

https://github.com/mirage/irmin/pull/2085

Change GC algorithm to visit disk pages at most once and disable page-cache for it

reduce-gc-impact-on-main-process, improve-gc-worker-runtime status: large unscheduled work

See initial work: https://github.com/Ngoguey42/segment_hangzhou/blob/34e300b94e1dbf01ab3f04e7667bbef604ae21e4//traverse.ml

Avoid suffix copies in GC

improve-disk-usage status: large work, scheduled for Q4

How to not block finalize with unlink

reduce-gc-impact-on-main-process status: unscheduled

https://github.com/mirage/irmin/issues/2091

Remove dead files on store opening

improve-disk-usage status: Ongoing, @art-w

Filter the LRU instead of completely clearing it.

reduce-gc-impact-on-main-process status: unscheduled, need new benchmark to inverstigate further

First attempt wasn't conclusive: https://github.com/mirage/irmin/pull/1993


Add back a non-forking GC

new-gc-use-case status: unscheduled

https://github.com/mirage/irmin/issues/2000

Add lower layer

new-gc-use-case status: large work, scheduled for Q4/Q1

For archive nodes.


Remove Lwt from the low-level (start/finalise) GC code

improve-code-quality status: experimental implementation

keep Lwt in the high level API

https://github.com/mirage/irmin/pull/2064

Remove exceptions from the low-level (start/finalise) GC code

improve-code-quality status: partial implementation

keep exception in the high level API

Done for GC worker: https://github.com/mirage/irmin/pull/2065

Rename gc.ml to gc_worker.ml and move GC code out of ext.ml to a new gc.ml

improve-code-quality status: done

https://github.com/mirage/irmin/pull/2063

Remove disk-specific functions from irmin-pack/s.ml

improve-code-quality status: done

https://github.com/mirage/irmin/pull/2081 https://github.com/mirage/irmin/pull/2084

Dead header size handling logic from apppend_only to dispatcher

improve-code-quality status: to brainstorm

Evaluate where code documentation misses

improve-code-quality status: Partially done

E.g. #1960

First batch: https://github.com/mirage/irmin/pull/2051

Remove mapping_consumers.

improve-code-quality status: done

2062

Improve error handling in new code

improve-code-quality

Make offsets abstract or private

improve-code-quality status: TODO

https://github.com/mirage/irmin/issues/1954

Make auto-flushes type safe

improve-code-quality status: done

https://github.com/mirage/irmin/pull/2051#discussion_r953942846 https://github.com/mirage/irmin/pull/2088

zshipko commented 2 years ago

I believe that Filter the LRU instead of completely clearing it is what was tried here: https://github.com/mirage/irmin/pull/1993 but it wasn’t necessarily an improvement over just clearing the LRU.