Open tatsuya6502 opened 10 years ago
I'll replace the local log files and live-hunk location files with single metadata DB. I'll use an embedded DB such as HanoiDB or LevelDB for it so that it will help me to unload some portion of the metadata from the RAM. I'll continue to use long-term log files to store the value part of key-values. I believe these embedded DBs are not good at handling large binary values so I want to keep it as is.
The scavenger will be merged to the write-back process, and the checkpoint process will no longer exists. Both write-back and scavenger (aka compactor) will be sequentially executed by single process to avoid race conditions between them (like the one causing hibari#33).
Note that the compactor will no longer update the (in-memory) ETS table for the metadata after it moves some live-hunks to a long-term log file. Instead, it will only update the (on-disk) metadata DB. This will reduce the performance impact that the current compactor has.
When a get request fails to locate the value because the value has been moved by the compactor and the ETS table has the stale location info, the brick server will read the updated location from the matadata DB, refreshes the ETS entry and finally read the value and returns it to the client.
Issues
- Rename operation broke the scavenger process (hibari#33)
- The scavenger process has major performance impact to read/write operations
- metadata for all key-values has to be kept in memory
As for the upcoming release (v0.3.0), only issue # 1 above will be addressed. However, a major rework on storage is being done for v0.3.0, and this will help future releases to address other ones above. Also, v0.3.0 will no longer have checkpoint operation, and scavenger steps are reorganized for efficiency.
Disk Storage for v0.3.0
{key(), reversed_timestamp()}
encoded by sext. v0.3.0 will use 0 for reversed timestamp, which is older than actual time.) Key-value versioning will work great for wide-area replication with causal+ consistency level [*1].*1: Paper: Don’t Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage With COPS
Status as of January 13th, 2014:
Almost finished the metadata DB part. Once finished, I will work on brick private value blob store. Diff between dev HEAD and the topic branch HEAD: https://github.com/hibari/gdss-brick/compare/6eec70727b...5901f65499
brick_hlog_scavenger
Started to work on the following items:
Added various modules in this commit https://github.com/hibari/gdss-brick/commit/b5fba54a03 to implement above items.
gdss-brick >> gh17 Redesign disk storage:
- Introduce new hunk log format (brick_hlog_hunk module):
- Unlike the generic gmt_hlog format, the new format is specialized for Hibari usage. It has four hunk types.
- metadata (for WAL files, many metadata blobs in one hunk)
- blob_wal (for WAL files, one value blob in one hunk)
- blob_single (for brick private files, one value blob in one hunk)
- blob_multi (for brick private files, many value blobs in one hunk)
- To make WAL (Write Ahead Log) write-back process simpler, these WAL hunks have a dedicated 'brick_name' field.
- Add API to read a value blob from a blob_* hunk without parsing the hunk header.
- Introduce new WAL and store file modules that will replace the old gmthlog* modules. (in progress)
- brick_hlog_wal - provides access to the WAL files including group committing.
- brick_hlog_writeback - does write back from a WAL file to the metadata and blob stores.
- brick_hlog_scavenger - (was introduced in an earlier commit) reclaims unused space from hunk log based store files.
- brick_metadata_store - is the common interface of metadata store.
- brick_blob_store - is the common interface of value blob store.
- brick_metadata_store_leveldb - is a LevelDB implementation of the metadata store.
- brick_blob_store_hlog - is an hunk log implementation of the value blob store.
DETAILS: ...
- brick private, metadata DB (LevelDB)
- Add metadata DB and write-back process for it. -- DONE
- Remove brick private hlog files. -- DONE
- Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
- Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
I replaced LevelDB with HyperLevelDB which is a fork and drop-in replacement of LevelDB with two key improvements:
Hibari will not get much benefit from first point because it uses single writer (WAL write-back process) per brick metadata DB, but will get some benefit from second point. I loaded 1 million key-values to a Hibari table with 12 bricks, chain length 1, and so far, so good.
I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb
https://github.com/hibari/gdss-brick/issues/17#issuecomment-58850121
I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb
Done.
https://github.com/hibari/gdss-brick/issues/17#issuecomment-32705058
- common-log files (write ahead log)
- Change internal format for more efficient and controlled write-back process.
- brick private, value blob store (long-term hlog files)
- Change long-term logs from shared to brick private.
Finished implementing new hlog format with the following key enhancements:
file:pread/3
call. (Eliminated ~50 lines of Erlang codes for reading.)file:pread/3
call, no unpacking is needed. Note that I'll make no-MD5 checking will be performed on each read as the default setting. Instead, scavenger and background scrub process will do it when scanning entire hlog file.metadata
, blob_wal
, blob_single
and blob_multi
. metadata
and blob_wal
are used in WAL. blob_single
and blob_multi
are used in brick private blob storage, and the former is for a larger blob (e.g. > 4KB) and the latter is for smaller blobs.In January, I implemented a gen-server for writing WAL from scratch. Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.
After that, I will update brick_ets server to utilize the new hlog format for writing and reading. Then finally, I will implement scavenger (aka compaction) process from scratch.
https://github.com/hibari/gdss-brick/issues/17#issuecomment-59618957
Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.
After that, I will update brick_ets server to utilize the new hlog format for writing and reading.
I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)
Also, before I start to commit these work-in-progress changes, I created a git annotate tag 2014-10-metadatadb
on the current branch head of gdss_brick and gdss_admin projects. That revision was tested with basho bench and had no error in six-hour runs.
Merged recent changes on the dev branch (post v0.1.11) into gbrick-gh17-redesign-disk-storage branch.
https://github.com/hibari/gdss-brick/issues/17#issuecomment-59618957
Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.
After that, I will update brick_ets server to utilize the new hlog format for writing and reading.
I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)
After a long pause (Oct 2014 -- May 2015), I resumed working on this topic (to implement the write-back process). I made a couple of commits on the topic branch of gdss_brick and so far I confirmed that all WAL hunks written to a WAL file can be parsed back to hunk records.
I'm trying to complete this task by the end of this month (May 2015).
https://github.com/hibari/gdss-brick/issues/17#issuecomment-32145192
Status as of January 13th, 2014:
- common-log files (write ahead log)
- Change internal format for more efficient and controlled write-back process. -- TODO
- brick private, metadata DB (LevelDB)
- Add metadata DB and write-back process for it. -- DONE
- Remove brick private hlog files. -- DONE
- Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
- Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
- brick private, value blob store (long-term hlog files)
- Change long-term logs from shared to brick private. -- TODO
- Note: scavenger relating codes have been moved to a new module
brick_hlog_scavenger
OK. I'm basically done with the above two TODO items. Now key-value's metadata (key, timestamp, user-provided property list and expiration-time) is stored in brick private metadata DB (HyperLevelDB), and value is stored in brick private hlog file.
% find data/ -type file | sort
data/brick/bootstrap_copy1/blob/000000000001.BLOB
data/brick/bootstrap_copy1/metadata/leveldb/000003.sst
data/brick/bootstrap_copy1/metadata/leveldb/000006.sst
data/brick/bootstrap_copy1/metadata/leveldb/000008.log
data/brick/bootstrap_copy1/metadata/leveldb/CURRENT
data/brick/bootstrap_copy1/metadata/leveldb/LOCK
data/brick/bootstrap_copy1/metadata/leveldb/LOG
data/brick/bootstrap_copy1/metadata/leveldb/LOG.old
data/brick/bootstrap_copy1/metadata/leveldb/MANIFEST-000007
data/brick/bootstrap_copy1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b1/blob/000000000001.BLOB
data/brick/perf1_ch10_b1/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000008.log
data/brick/perf1_ch10_b1/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b1/metadata/leveldb/LOCK
data/brick/perf1_ch10_b1/metadata/leveldb/LOG
data/brick/perf1_ch10_b1/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b1/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b2/blob/000000000001.BLOB
data/brick/perf1_ch10_b2/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000008.log
data/brick/perf1_ch10_b2/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b2/metadata/leveldb/LOCK
data/brick/perf1_ch10_b2/metadata/leveldb/LOG
data/brick/perf1_ch10_b2/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b2/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b2/metadata/leveldb/lost/...
...
data/wal_hlog/000000000002.HLOG
data/wal_hlog/000000000003.HLOG
Now the last big part will be re-implementing the scavenger (aka compaction process) from scratch. Hope I can finish it in two weeks.
I spent last few days to the followings:
The new compaction process should be much more efficient than the current scavenger implementation in v0.1 series. Here is the current design:
Also, I'm planning to store small values in the metadata DB (HyperLevelDB) rather than the blob hlog files. HyperLevelDB has efficient compaction implementation in C++ so I hope this design change will improve the overall compaction efficiency too.
As for the compaction process, I have implemented the following main functions:
-spec estimate_live_hunk_ratio(brickname(), seqnum()) -> {ok, float()} | {error, term()}.
-spec compact_hlog_file(brickname(), seqnum()) -> ok | {error, term()}.
The former estimates live/dead blob hunks ratio per hlof file by comparing randomly sampled keys against the metadata DB. The latter runs compaction on an hlog file to reclaim disk space and updates storage locations of live hunks.
Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.
Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.
As the first step, I created a temporary function to estimate live hunk ratios of all value blob hlog files on a node (commit: https://github.com/hibari/gdss-brick/commit/53a697eea8525f354a01f2c7e6d6a4a8800ab648#diff-8bd48a5a77dd1a285b7729b3766317e0R110).
Here is a sample run on a single-node Hibari with perf1 table (chain length = 3, number of chains = 8):
(hibari@127.0.0.1)43> F = fun() -> lists:foreach(
(hibari@127.0.0.1)43> fun({B, S, unknown}) ->
(hibari@127.0.0.1)43> io:format("~s (~w): unknown~n", [B, S]);
(hibari@127.0.0.1)43> ({B, S, R}) ->
(hibari@127.0.0.1)43> io:format("~s (~w): ~.2f%~n", [B, S, R * 100])
(hibari@127.0.0.1)43> end, brick_blob_store_hlog_compaction:list_hlog_files())
(hibari@127.0.0.1)43> end.
#Fun<erl_eval.20.90072148>
(hibari@127.0.0.1)44> F().
bootstrap_copy1 (1): 21.43%
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b1 (3): 65.67%
perf1_ch1_b2 (1): 3.35%
...
Here are the numbers for perf1 chain 1 ordered by chain, brick and hlog sequence number:
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b2 (1): 3.35%
perf1_ch1_b3 (1): 5.84%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b2 (2): 25.12%
perf1_ch1_b3 (2): 25.17%
perf1_ch1_b1 (3): 65.82%
perf1_ch1_b2 (3): 82.28%
perf1_ch1_b3 (3): 72.60%
All bricks (b1, b2, b3) in a chain should have the exact same contents for each hlog files with sequence numbers (1, 2, 3). However the estimated live hunk ratios are different (for example, 4.91%, 3.35%, 5.84%, or 65.82%, 82.28%, 72.60%). This is because the estimation is done by randomly sampled keys. It's currently using about 5% of keys in an hlog file for the estimation.
I think current setting is still providing estimated ratios in good enough precision.
Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.
I have implemented a very basic version of above. (The last commit was: https://github.com/hibari/gdss-brick/commit/de7c990b408b36b6b765c936c1800d5c56249cf0) There are lots of places to improve but now the new disk storage format has a complete set of functions.
I'll shift my focus to other tasks but will continue improving this feature too. I'm planning to release Hibari v0.3 with this feature sometime in this fall (2015).
I found and fixed a bug in the write-back process that would miss some log hunks in the WAL when group commit is enabled:
Commit: https://github.com/hibari/gdss-brick/commit/32428e411fb54defbf7600e07a4fb0531547d75b
https://github.com/hibari/gdss-brick/issues/17#issuecomment-59352027
I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb
Done.
It seems HyperLevelDB is not actively developed anymore; the last commit was made in Sep 2014. I'm thinking to switch to RocksDB that is another fork of LevelDB. It is very actively developed and has a large user base.
I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).
I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).
I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).
I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).
I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).
I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).
Redesign and re-implement disk storage and maintenance processes to address the issues Hibari is having right now (v0.3RC1).
Issues
Disk Storage
Maintenance Processes