hibari / gdss-brick

http://hibari.github.com/hibari-doc/
Other
11 stars 2 forks source link

Redesign disk storage and checkpoint/scavenger processes #17

Open tatsuya6502 opened 10 years ago

tatsuya6502 commented 10 years ago

Redesign and re-implement disk storage and maintenance processes to address the issues Hibari is having right now (v0.3RC1).

Issues

tatsuya6502 commented 10 years ago

Draft Design

Disk Storage

I'll replace the local log files and live-hunk location files with single metadata DB. I'll use an embedded DB such as HanoiDB or LevelDB for it so that it will help me to unload some portion of the metadata from the RAM. I'll continue to use long-term log files to store the value part of key-values. I believe these embedded DBs are not good at handling large binary values so I want to keep it as is.

Maintenance Processes

The scavenger will be merged to the write-back process, and the checkpoint process will no longer exists. Both write-back and scavenger (aka compactor) will be sequentially executed by single process to avoid race conditions between them (like the one causing hibari#33).

tatsuya6502 commented 10 years ago

Note that the compactor will no longer update the (in-memory) ETS table for the metadata after it moves some live-hunks to a long-term log file. Instead, it will only update the (on-disk) metadata DB. This will reduce the performance impact that the current compactor has.

When a get request fails to locate the value because the value has been moved by the compactor and the ETS table has the stale location info, the brick server will read the updated location from the matadata DB, refreshes the ETS entry and finally read the value and returns it to the client.

tatsuya6502 commented 10 years ago

Issues

  1. Rename operation broke the scavenger process (hibari#33)
  2. The scavenger process has major performance impact to read/write operations
  3. metadata for all key-values has to be kept in memory

As for the upcoming release (v0.3.0), only issue # 1 above will be addressed. However, a major rework on storage is being done for v0.3.0, and this will help future releases to address other ones above. Also, v0.3.0 will no longer have checkpoint operation, and scavenger steps are reorganized for efficiency.

Disk Storage for v0.3.0

*1: Paper: Don’t Settle For Eventual: Scalable Causal Consistency For Wide-Area Storage With COPS

tatsuya6502 commented 10 years ago

Status as of January 13th, 2014:

Almost finished the metadata DB part. Once finished, I will work on brick private value blob store. Diff between dev HEAD and the topic branch HEAD: https://github.com/hibari/gdss-brick/compare/6eec70727b...5901f65499

tatsuya6502 commented 10 years ago

Started to work on the following items:

Added various modules in this commit https://github.com/hibari/gdss-brick/commit/b5fba54a03 to implement above items.

gdss-brick >> gh17 Redesign disk storage:

  • Introduce new hunk log format (brick_hlog_hunk module):
    • Unlike the generic gmt_hlog format, the new format is specialized for Hibari usage. It has four hunk types.
    • metadata (for WAL files, many metadata blobs in one hunk)
    • blob_wal (for WAL files, one value blob in one hunk)
    • blob_single (for brick private files, one value blob in one hunk)
    • blob_multi (for brick private files, many value blobs in one hunk)
    • To make WAL (Write Ahead Log) write-back process simpler, these WAL hunks have a dedicated 'brick_name' field.
    • Add API to read a value blob from a blob_* hunk without parsing the hunk header.
  • Introduce new WAL and store file modules that will replace the old gmthlog* modules. (in progress)
    • brick_hlog_wal - provides access to the WAL files including group committing.
    • brick_hlog_writeback - does write back from a WAL file to the metadata and blob stores.
    • brick_hlog_scavenger - (was introduced in an earlier commit) reclaims unused space from hunk log based store files.
    • brick_metadata_store - is the common interface of metadata store.
    • brick_blob_store - is the common interface of value blob store.
    • brick_metadata_store_leveldb - is a LevelDB implementation of the metadata store.
    • brick_blob_store_hlog - is an hunk log implementation of the value blob store.

DETAILS: ...

tatsuya6502 commented 10 years ago
  • brick private, metadata DB (LevelDB)
    • Add metadata DB and write-back process for it. -- DONE
    • Remove brick private hlog files. -- DONE
    • Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
    • Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE

I replaced LevelDB with HyperLevelDB which is a fork and drop-in replacement of LevelDB with two key improvements:

Hibari will not get much benefit from first point because it uses single writer (WAL write-back process) per brick metadata DB, but will get some benefit from second point. I loaded 1 million key-values to a Hibari table with 12 bricks, chain length 1, and so far, so good.

tatsuya6502 commented 9 years ago

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb

tatsuya6502 commented 9 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-58850121

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb

Done.

tatsuya6502 commented 9 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-32705058

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process.
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private.

Finished implementing new hlog format with the following key enhancements:

In January, I implemented a gen-server for writing WAL from scratch. Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading. Then finally, I will implement scavenger (aka compaction) process from scratch.

tatsuya6502 commented 9 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-59618957

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

Also, before I start to commit these work-in-progress changes, I created a git annotate tag 2014-10-metadatadb on the current branch head of gdss_brick and gdss_admin projects. That revision was tested with basho bench and had no error in six-hour runs.

tatsuya6502 commented 9 years ago

Merged recent changes on the dev branch (post v0.1.11) into gbrick-gh17-redesign-disk-storage branch.

tatsuya6502 commented 9 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-59618957

Next step will be to implement write-back process from WAL to metadata DB (HyperLevelDB) and value blob hlog file.

After that, I will update brick_ets server to utilize the new hlog format for writing and reading.

I started to work on the above. I actually started from the latter one and Hibari can now bootstrap from the new hlog modules. (but it's not very useful without the write-back process and scavenger.)

After a long pause (Oct 2014 -- May 2015), I resumed working on this topic (to implement the write-back process). I made a couple of commits on the topic branch of gdss_brick and so far I confirmed that all WAL hunks written to a WAL file can be parsed back to hunk records.

I'm trying to complete this task by the end of this month (May 2015).

tatsuya6502 commented 9 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-32145192

Status as of January 13th, 2014:

  • common-log files (write ahead log)
    • Change internal format for more efficient and controlled write-back process. -- TODO
  • brick private, metadata DB (LevelDB)
    • Add metadata DB and write-back process for it. -- DONE
    • Remove brick private hlog files. -- DONE
    • Remove checkpoint process and shadow ETS table. -- DONE except updating test cases.
    • Update brick_ets to load metadata records (store tuples) from metadata DB. -- DONE
  • brick private, value blob store (long-term hlog files)
    • Change long-term logs from shared to brick private. -- TODO
    • Note: scavenger relating codes have been moved to a new module brick_hlog_scavenger

OK. I'm basically done with the above two TODO items. Now key-value's metadata (key, timestamp, user-provided property list and expiration-time) is stored in brick private metadata DB (HyperLevelDB), and value is stored in brick private hlog file.

% find data/ -type file | sort
data/brick/bootstrap_copy1/blob/000000000001.BLOB
data/brick/bootstrap_copy1/metadata/leveldb/000003.sst
data/brick/bootstrap_copy1/metadata/leveldb/000006.sst
data/brick/bootstrap_copy1/metadata/leveldb/000008.log
data/brick/bootstrap_copy1/metadata/leveldb/CURRENT
data/brick/bootstrap_copy1/metadata/leveldb/LOCK
data/brick/bootstrap_copy1/metadata/leveldb/LOG
data/brick/bootstrap_copy1/metadata/leveldb/LOG.old
data/brick/bootstrap_copy1/metadata/leveldb/MANIFEST-000007
data/brick/bootstrap_copy1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b1/blob/000000000001.BLOB
data/brick/perf1_ch10_b1/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b1/metadata/leveldb/000008.log
data/brick/perf1_ch10_b1/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b1/metadata/leveldb/LOCK
data/brick/perf1_ch10_b1/metadata/leveldb/LOG
data/brick/perf1_ch10_b1/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b1/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b1/metadata/leveldb/lost/...
data/brick/perf1_ch10_b2/blob/000000000001.BLOB
data/brick/perf1_ch10_b2/metadata/leveldb/000003.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000006.sst
data/brick/perf1_ch10_b2/metadata/leveldb/000008.log
data/brick/perf1_ch10_b2/metadata/leveldb/CURRENT
data/brick/perf1_ch10_b2/metadata/leveldb/LOCK
data/brick/perf1_ch10_b2/metadata/leveldb/LOG
data/brick/perf1_ch10_b2/metadata/leveldb/LOG.old
data/brick/perf1_ch10_b2/metadata/leveldb/MANIFEST-000007
data/brick/perf1_ch10_b2/metadata/leveldb/lost/...
...
data/wal_hlog/000000000002.HLOG
data/wal_hlog/000000000003.HLOG

Now the last big part will be re-implementing the scavenger (aka compaction process) from scratch. Hope I can finish it in two weeks.

tatsuya6502 commented 9 years ago

I spent last few days to the followings:

  1. Brushing up the write-back process. It was done at commit: https://github.com/hibari/gdss-brick/commit/ad68753d58caa5711e74a6b7b5321e386d8a27f0
  2. Re-implementing the compaction process (aka scavenger process) from scratch.

The new compaction process should be much more efficient than the current scavenger implementation in v0.1 series. Here is the current design:

Also, I'm planning to store small values in the metadata DB (HyperLevelDB) rather than the blob hlog files. HyperLevelDB has efficient compaction implementation in C++ so I hope this design change will improve the overall compaction efficiency too.

tatsuya6502 commented 9 years ago

As for the compaction process, I have implemented the following main functions:

-spec estimate_live_hunk_ratio(brickname(), seqnum()) -> {ok, float()} | {error, term()}.
-spec compact_hlog_file(brickname(), seqnum()) -> ok | {error, term()}.

The former estimates live/dead blob hunks ratio per hlof file by comparing randomly sampled keys against the metadata DB. The latter runs compaction on an hlog file to reclaim disk space and updates storage locations of live hunks.

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

tatsuya6502 commented 9 years ago

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

As the first step, I created a temporary function to estimate live hunk ratios of all value blob hlog files on a node (commit: https://github.com/hibari/gdss-brick/commit/53a697eea8525f354a01f2c7e6d6a4a8800ab648#diff-8bd48a5a77dd1a285b7729b3766317e0R110).

Here is a sample run on a single-node Hibari with perf1 table (chain length = 3, number of chains = 8):

(hibari@127.0.0.1)43> F = fun() -> lists:foreach(
(hibari@127.0.0.1)43>                fun({B, S, unknown}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): unknown~n", [B, S]);
(hibari@127.0.0.1)43>                   ({B, S, R}) ->
(hibari@127.0.0.1)43>                        io:format("~s (~w): ~.2f%~n", [B, S, R * 100])
(hibari@127.0.0.1)43>                end, brick_blob_store_hlog_compaction:list_hlog_files())
(hibari@127.0.0.1)43>     end.
#Fun<erl_eval.20.90072148>
(hibari@127.0.0.1)44> F().
bootstrap_copy1 (1): 21.43%
perf1_ch1_b1 (1): 4.91%
perf1_ch1_b1 (2): 26.89%
perf1_ch1_b1 (3): 65.67%
perf1_ch1_b2 (1): 3.35%
...

Here are the numbers for perf1 chain 1 ordered by chain, brick and hlog sequence number:

perf1_ch1_b1 (1): 4.91%
perf1_ch1_b2 (1): 3.35%
perf1_ch1_b3 (1): 5.84%

perf1_ch1_b1 (2): 26.89%
perf1_ch1_b2 (2): 25.12%
perf1_ch1_b3 (2): 25.17%

perf1_ch1_b1 (3): 65.82%
perf1_ch1_b2 (3): 82.28%
perf1_ch1_b3 (3): 72.60%

All bricks (b1, b2, b3) in a chain should have the exact same contents for each hlog files with sequence numbers (1, 2, 3). However the estimated live hunk ratios are different (for example, 4.91%, 3.35%, 5.84%, or 65.82%, 82.28%, 72.60%). This is because the estimation is done by randomly sampled keys. It's currently using about 5% of keys in an hlog file for the estimation.

I think current setting is still providing estimated ratios in good enough precision.

tatsuya6502 commented 9 years ago

Next step will be to implement periodical task to estimate the live hunk ratio, score hlog files, and pick an hlog file to run a compaction.

I have implemented a very basic version of above. (The last commit was: https://github.com/hibari/gdss-brick/commit/de7c990b408b36b6b765c936c1800d5c56249cf0) There are lots of places to improve but now the new disk storage format has a complete set of functions.

I'll shift my focus to other tasks but will continue improving this feature too. I'm planning to release Hibari v0.3 with this feature sometime in this fall (2015).

tatsuya6502 commented 9 years ago

I found and fixed a bug in the write-back process that would miss some log hunks in the WAL when group commit is enabled:

Commit: https://github.com/hibari/gdss-brick/commit/32428e411fb54defbf7600e07a4fb0531547d75b

tatsuya6502 commented 8 years ago

https://github.com/hibari/gdss-brick/issues/17#issuecomment-59352027

I open-sourced the Erlang binding to HyperLevelDB. I'll update brick server's code to utilize it. https://github.com/leveldb-erlang/h2leveldb

Done.

It seems HyperLevelDB is not actively developed anymore; the last commit was made in Sep 2014. I'm thinking to switch to RocksDB that is another fork of LevelDB. It is very actively developed and has a large user base.

tatsuya6502 commented 8 years ago

I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).

tatsuya6502 commented 8 years ago

I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).

tatsuya6502 commented 8 years ago

I found and fixed a couple of bugs related to node restart (https://github.com/hibari/gdss-brick/compare/91b62e049e...9f9efe9052).

I also removed obsolete gmt_hlog* and scavenger modules from v0.3 stream (https://github.com/hibari/gdss-brick/commit/44fd692ca89e4104bf7a500b2ee9ed8ec8006d9a), and updated the app.src of gdss_brick application (https://github.com/hibari/gdss-brick/commit/e0e71ead05543b4127373ce968263d97ad6eda70).