lotusdblabs / lotusdb

Most advanced key-value database written in Go, extremely fast, compatible with LSM tree and B+ tree.
https://lotusdblabs.github.io
Apache License 2.0
2.07k stars 180 forks source link

AutoCompact & CompactWithDeprecateTable Support #165

Closed KANIOYH closed 3 weeks ago

KANIOYH commented 2 months ago

AutoCompact & CompactWithDeprecateable

Add autoCompact support.

AutoCompact can automatically occur in the background. There are not implementation for IO state, only reserve space for it.

This feature adds a memory-based deprecated table (dptable), which marks deprecated data in the vlog based on the UUID of each operation, without the need to retrieve deprecated data through the bptree.

This feature mainly reduces the overhead of accessing the bptree.Reduced by approximately one order of magnitude. bptreeTime:219.758243 ms > dptableTime:24.207176 ms

We need to further discuss the IO status. It is initially believed that disk-level monitoring is superior to API hooks.

disk-level monitoring | maybe useful:github.com/shirou/gopsutil

//output:
[{"device":"C:","mountpoint":"C:","fstype":"NTFS","opts":"rw.compress"} {"device":"D:","mountpoint":"D:","fstype":"NTFS","opts":"rw.compress"} {"device":"E:","mountpoint":"E:","fstype":"NTFS","opts":"rw.compress"} ]
{"path":"E:","fstype":"","total":107380965376,"free":46790828032,"used":60590137344,"usedPercent":56.425398236866755,"inodesTotal":0,"inodesUsed":0,"inodesFree":0,"inodesUsedPercent":0}
map[C::{"readCount":0,"mergedReadCount":0,"writeCount":0,"mergedWriteCount":0,"readBytes":0,"writeBytes":4096,"readTime":0,"writeTime":0,"iopsInProgress":0,"ioTime":0,"weightedIO":0,"name":"C:","serialNumber":"","label":""} ]

Changes

Test Reuslt

According to a mixed read-write ratio of 5:1.

Compare

The results of comparing Compact and CompactWithDeprecatedable are as follows:

=== RUN   TestDBCompact/test_compaction
[Compact data]
(shard:0) bptreeTime:219.758243ms       readerTIme:331.78238ms  rewriteTime:96.310796ms all:820.391944ms
(shard:2) bptreeTime:232.646786ms       readerTIme:344.188357ms rewriteTime:91.904951ms all:858.445579ms
(shard:1) bptreeTime:257.153977ms       readerTIme:357.054894ms rewriteTime:82.546513ms all:886.131432ms
--- PASS: TestDBCompact (19.06s) # <-- ignore, it include data preload -->
    --- PASS: TestDBCompact/test_compaction (0.89s)

=== RUN   TestDBCompactWitchDeprecateable/test_compaction
[CompactWithDeprecatedable data]
(shard:2) dptableTime:24.207176ms       reader:297.85245ms      rewriteTime:89.254724ms all:562.317783ms
(shard:0) dptableTime:25.151011ms       reader:314.874621ms     rewriteTime:92.429257ms all:586.549001ms
(shard:1) dptableTime:28.363133ms       reader:368.892013ms     rewriteTime:88.781865ms all:647.749456ms
--- PASS: TestDBCompactWitchDeprecateable (18.73s) # <-- ignore, it include data preload -->
    --- PASS: TestDBCompactWitchDeprecateable/test_compaction (0.65s)

We can easily observe that bptreeTime is significantly higher than dptableTime, as it has a greater access overhead. Notably, the original compact has slightly lower overhead when modifying the B+ tree information a second time. This is likely because after the first read, there is a high probability that the cache entry exists in memory, which is a common characteristic of B+ trees.


TestDBAutoCompact

=== RUN   TestDBAutoCompact
[Compact data]
(shard:2) bptreeTime:  0s       readerTIme:26.373µs     rewriteTime:  0s        all:111.478µs
(shard:1) bptreeTime:  0s       readerTIme:23.372µs     rewriteTime:  0s        all:123.234µs
(shard:0) bptreeTime:  0s       readerTIme:150.454µs    rewriteTime:  0s        all:346.099µs
=== RUN   TestDBAutoCompact/test_compaction
[data in flush] deprecatedNumber: 0 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 42444 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 87332 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 134664 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 184440 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 226654 LowerThreshold: 204800 UpperThreshold: 409600
ArriveLowerThreshold
[CompactWithDeprecatedable data]
(shard:2) dptableTime:18.524262ms       reader:237.848375ms     rewriteTime:106.529464ms        all:468.550733ms
(shard:0) dptableTime:22.412305ms       reader:296.664449ms     rewriteTime:109.612276ms        all:546.467484ms
(shard:1) dptableTime:25.431771ms       reader:318.471562ms     rewriteTime:120.71244ms all:617.222454ms
[data in flush] deprecatedNumber: 44658 LowerThreshold: 204800 UpperThreshold: 409600
[data in flush] deprecatedNumber: 91759 LowerThreshold: 204800 UpperThreshold: 409600
--- PASS: TestDBAutoCompact (17.19s)
    --- PASS: TestDBAutoCompact/test_compaction (17.15s)
PASS
ok      github.com/lotusdblabs/lotusdb/v2       17.228s
KANIOYH commented 1 month ago

A Monitoring Scheme Based on I/O Time: Leveraging gopsutil, we can obtaindisk I/O information. We introduce a background listening coroutine that periodically retrieves the current I/O information. Users can input an I/O time proportion threshold rate to determine whether the current disk is busy. We sample I/O Time at a fixed interval,samplingInterval, where I/O Time represents the disk I/O execution time since the system was turned on. Two interval samples can provide the IOTime for that period. We then compare the size of rate * samplingInterval with IOTime to judge whether the current I/O is busy.

roseduan commented 1 month ago

感谢 PR,请修复对应的 comment,并且 rebase 一下代码。