[Hibari release candidate] bad sequence file 'enoent' on a common log file deleted by the scavenger

This is a blocker issue.

This error occured while runnig a 8-hour stability test against 4-node Hibari 0.3.0 RC (Release Candidate). All Hibari nodes were running within one Solaris Zone-based SmartOS machine at Joyent Cloud. (Machine type: smartmachine medium 4GB RAM, 4 vCPUs)

Summary:

At 04:00, node hibari1@127.0.0.0 got the following bad sequence file error enoent on a common log file with sequence 1. That file was deleted by the scavenger between 03:01 and 03:11.

2013-03-30 04:00:57.401 [warning] <0.31010.58>@brick_ets:bigdata_dir_get_val:2922 stability_test_ch3_b3: error {error,enoent} at 1 36210172
2013-03-30 04:00:57.403 [warning] <0.610.0>@brick_ets:sequence_file_is_bad_common:3022 Brick commonLogServer bad sequence file: seq 1 offset 36210172: rename data/brick/hlog.commonLogServer/s/000000000001.HLOG or data/brick/hlog.commonLogServer/2/2/-000000000001.HLOG to data/brick/hlog.commonLogServer/BAD-CHECKSUM/1.at.offset.36210172 failed: enoent

hibair1@127.0.0.1 was hosting 3 logical bricks and reported total 2 keys were purged by absence of log 1. It put affected bricks in read-only mode, and other brick servers took over the roles of the affected bricks. So there was no data loss in this particular case.

2013-03-30 04:00:57.818 [info] <0.769.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch3_b3: 1 purged keys for log 1
2013-03-30 04:00:57.897 [info] <0.758.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch4_b2: 0 purged keys for log 1
2013-03-30 04:00:57.900 [info] <0.783.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch1_b1: 1 purged keys for log 1

...

2013-03-30 04:00:59.510 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch1_b1,'hibari1@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.511 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1309 set_all_chain_roles: stability_test_ch1: brick read-only 1
2013-03-30 04:00:59.514 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1352 New head {stability_test_ch1_b2,'hibari2@127.0.0.1'}: ChainDownSerial 368037 /= ChainDownAcked -4242, reflushing log downstream
2013-03-30 04:00:59.516 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1360 New head {stability_test_ch1_b2,'hibari2@127.0.0.1'}: flush was ok
2013-03-30 04:00:59.516 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch1 status degraded belongs to tab stability_test

The scavenger process was started at 3:00 and finished 3:11. It didn't report any problem including updating locations for log 1.

Conditions:

4-node Hibari 0.3.0 RC culuster (1 admin, 4 physical blicks)
syncwrites = true
scavenger setting
- {brick_scavenger_start_time, "03:00"}
- {brick_scavenger_temp_dir,"/scav/hibari1_scavenger"}
- {brick_scavenger_throttle_bytes, 2097152}
- {brick_skip_live_percentage_greater_than, 90}

hibari1 - console.log

2013-03-30 03:01:05.421 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 1 (true)
2013-03-30 03:01:06.144 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 1: UpRes ok, 12623 updates
2013-03-30 03:01:15.782 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 3 (true)
2013-03-30 03:01:16.386 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 3: UpRes ok, 10017 updates
2013-03-30 03:01:26.235 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330030126 [{0,4},{282,282},{900824,2357540}]
2013-03-30 03:01:29.376 [info] <0.25906.46>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 213 msec for 1 callers
2013-03-30 03:01:46.622 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: Updating locations for sequence 4 (true)
2013-03-30 03:01:49.194 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 SCAV: commonLogServer sequence 4: UpRes ok, 38232 updates
2013-03-30 03:01:54.519 [info] <0.28395.46>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 312 msec for 1 callers

...

2013-03-30 03:11:14.029 [info] <0.5250.46>@brick_ets:make_info_log_fun:3558 Scavenger finished:
    Options: [destructive,{skip_live_percentage_greater_than,90},{work_dir,"/scav/hibari1_scavenger"}]
    Logs deleted without scavenging: 2 (24 bytes)
    Logs scavenged: 21
    Copied: hunks bytes errs = 763583 781908992 0
    Reclaimed bytes = 308037249

...

2013-03-30 04:00:26.236 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330040026 [{1,4},{277,285},{900824,2357540}]
2013-03-30 04:00:57.401 [warning] <0.31010.58>@brick_ets:bigdata_dir_get_val:2922 stability_test_ch3_b3: error {error,enoent} at 1 36210172
2013-03-30 04:00:57.403 [warning] <0.610.0>@brick_ets:sequence_file_is_bad_common:3022 Brick commonLogServer bad sequence file: seq 1 offset 36210172: rename data/brick/hlog.commonLogServer/s/000000000001.HLOG or data/brick/hlog.commonLogServer/2/2/-000000000001.HLOG to data/brick/hlog.commonLogServer/BAD-CHECKSUM/1.at.offset.36210172 failed: enoent
2013-03-30 04:00:57.405 [info] <0.769.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch3_b3: purging keys with sequence 1, size 259805
2013-03-30 04:00:57.407 [info] <0.783.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch1_b1: purging keys with sequence 1, size 368467
2013-03-30 04:00:57.408 [info] <0.758.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch4_b2: purging keys with sequence 1, size 347070
2013-03-30 04:00:57.818 [info] <0.769.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch3_b3: 1 purged keys for log 1
2013-03-30 04:00:57.897 [info] <0.758.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch4_b2: 0 purged keys for log 1
2013-03-30 04:00:57.900 [info] <0.783.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch1_b1: 1 purged keys for log 1
2013-03-30 04:00:58.340 [info] <0.615.0>@brick_mboxmon:handle_info:152 node 'basho_bench@127.0.0.1' down Extra [{nodedown_reason,connection_closed},{node_type,visible}]
2013-03-30 04:00:58.853 [error] <0.690.0>@brick_admin:run_client_monitor_procs:2171 Node 'basho_bench@127.0.0.1' application gdss_client is stopped
2013-03-30 04:00:58.855 [info] <0.530.0> alarm_handler: {set,{{client_node_down,'basho_bench@127.0.0.1'},"GDSS application is not running on this node."}}
2013-03-30 04:00:59.510 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1307 set_all_chain_roles: stability_test_ch1: top
2013-03-30 04:00:59.510 [info] <0.530.0>@brick_admin_event_h:handle_event:120 handle_event: {set_alarm,{{disk_error,{stability_test_ch1_b1,'hibari1@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.510 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch1_b1,'hibari1@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.511 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1309 set_all_chain_roles: stability_test_ch1: brick read-only 1
2013-03-30 04:00:59.514 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1352 New head {stability_test_ch1_b2,'hibari2@127.0.0.1'}: ChainDownSerial 368037 /= ChainDownAcked -4242, reflushing log downstream
2013-03-30 04:00:59.516 [info] <0.696.0>@brick_chainmon:set_all_chain_roles:1360 New head {stability_test_ch1_b2,'hibari2@127.0.0.1'}: flush was ok
2013-03-30 04:00:59.516 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch1 status degraded belongs to tab stability_test
2013-03-30 04:00:59.528 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.529 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=74
2013-03-30 04:00:59.529 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch1 status degraded belongs to tab stability_test
2013-03-30 04:00:59.542 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.543 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=75
2013-03-30 04:00:59.697 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1307 set_all_chain_roles: stability_test_ch3: top
2013-03-30 04:00:59.697 [info] <0.530.0>@brick_admin_event_h:handle_event:120 handle_event: {set_alarm,{{disk_error,{stability_test_ch3_b3,'hibari1@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.697 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch3_b3,'hibari1@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 04:00:59.697 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1309 set_all_chain_roles: stability_test_ch3: brick read-only 1
2013-03-30 04:00:59.701 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1352 New head {stability_test_ch3_b1,'hibari3@127.0.0.1'}: ChainDownSerial 259439 /= ChainDownAcked -4242, reflushing log downstream
2013-03-30 04:00:59.703 [info] <0.698.0>@brick_chainmon:set_all_chain_roles:1360 New head {stability_test_ch3_b1,'hibari3@127.0.0.1'}: flush was ok
2013-03-30 04:00:59.703 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch3 status degraded belongs to tab stability_test
2013-03-30 04:00:59.715 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1478 do_spam_gh_to_all_nodes: top
2013-03-30 04:00:59.715 [info] <0.645.0>@brick_admin:do_spam_gh_to_all_nodes:1488 do_spam_gh_to_all_nodes: minor_rev=76
2013-03-30 04:00:59.716 [info] <0.645.0>@brick_admin:do_chain_status_change:1410 status_change: Chain stability_test_ch3 status degraded belongs to tab stability_test

Basho Bench - console.log.0

I'll update the driver so that it can handle those errors rather than crashing.

2013-03-29 23:07:44.851 [info] <0.184.0>@basho_bench_worker:worker_idle_loop:227 Starting 100.0 ms/req fixed rate worker: <0.184.0>
2013-03-29 23:07:44.851 [info] <0.182.0>@basho_bench_worker:worker_idle_loop:227 Starting 100.0 ms/req fixed rate worker: <0.182.0>
2013-03-30 04:00:57.820 [debug] <0.205.0>@basho_bench_worker:worker_next_op:265 Driver basho_bench_driver_hibari_native crashed: {{case_clause,{error,current_repair_state,disk_error}},[{basho_bench_driver_hibari_native,run,4,[{file,"src/basho_bench_driver_hibari_native.erl"},{line,78}]},{basho_bench_worker,worker_next_op2,2,[{file,"src/basho_bench_worker.erl"},{line,233}]},{basho_bench_worker,worker_next_op,1,[{file,"src/basho_bench_worker.erl"},{line,239}]},{basho_bench_worker,rate_worker_run_loop,2,[{file,"src/basho_bench_worker.erl"},{line,308}]}]}
2013-03-30 04:00:57.820 [debug] <0.215.0>@basho_bench_worker:worker_next_op:265 Driver basho_bench_driver_hibari_native crashed: {{case_clause,{error,current_repair_state,disk_error}},[{basho_bench_driver_hibari_native,run,4,[{file,"src/basho_bench_driver_hibari_native.erl"},{line,78}]},{basho_bench_worker,worker_next_op2,2,[{file,"src/basho_bench_worker.erl"},{line,233}]},{basho_bench_worker,worker_next_op,1,[{file,"src/basho_bench_worker.erl"},{line,239}]},{basho_bench_worker,rate_worker_run_loop,2,[{file,"src/basho_bench_worker.erl"},{line,308}]}]}
2013-03-30 04:00:57.820 [error] <0.204.0>@basho_bench_worker:handle_info:149 Worker <0.205.0> exited with crash
2013-03-30 04:00:57.820 [error] <0.214.0>@basho_bench_worker:handle_info:149 Worker <0.215.0> exited with crash

SmartOS machine info

Solaris Zone-based SmartOS machine at Joyent Cloud
Machine type: smartmachine medium 4GB RAM, 4 vCPUs
Image: base 1.9.1
Pkgsrc: http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/i386/All

# uname -a
SunOS 4f80810f-1aa5-4df3-9043-a625c327094d.local 5.11 joyent_20130111T180733Z i86pc i386 i86pc Solaris

# erl  
Erlang R15B02 (erts-5.9.2) [source] [smp:24:24] [async-threads:0] [kernel-poll:false] [dtrace]

Eshell V5.9.2  (abort with ^G)
1> 
User switch command
 --> q

# sm-summary
* Gathering SmartMachine summary..
SM UUID             ***
SM ID               ***
Hostname            ***
SmartOS build       joyent_20130111T180733Z 
Image               base 1.9.1
Base Image          NA
Documentation       http://wiki.joyent.com/jpc2/SmartMachine+Base
Pkgsrc              http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/i386/All
Processes           81
Memory (RSS) Cap    4096M
Memory (RSS) Used   260M
Memory (RSS) Free   3836M
Swap Cap            8192M
Swap Used           125M
/tmp Used           4.0K
Disk Quota          121G
% Disk Used         8%

# find /data/hibari1_data
/data/hibari1_data
/data/hibari1_data/brick
/data/hibari1_data/brick/hlog.stability_test_ch3_b3
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/sequence_number
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch3_b3/bad-sequence
/data/hibari1_data/brick/hlog.bootstrap_copy1
/data/hibari1_data/brick/hlog.bootstrap_copy1/s
/data/hibari1_data/brick/hlog.bootstrap_copy1/s/000000000001.HLOG
/data/hibari1_data/brick/hlog.bootstrap_copy1/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.bootstrap_copy1/sequence_number
/data/hibari1_data/brick/hlog.bootstrap_copy1/bad-sequence
/data/hibari1_data/brick/hlog.stability_test_ch4_b2
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/sequence_number
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch4_b2/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch3_b1
/data/hibari1_data/brick/hlog.tab1_ch3_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch3_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch3_b1/s
/data/hibari1_data/brick/hlog.tab1_ch2_b1
/data/hibari1_data/brick/hlog.tab1_ch2_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch2_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch2_b1/s
/data/hibari1_data/brick/hlog.commonLogServer
/data/hibari1_data/brick/hlog.commonLogServer/flush
/data/hibari1_data/brick/hlog.commonLogServer/1
/data/hibari1_data/brick/hlog.commonLogServer/1/3
/data/hibari1_data/brick/hlog.commonLogServer/1/3/-000000000030.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/4
/data/hibari1_data/brick/hlog.commonLogServer/1/4/-000000000024.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/7
/data/hibari1_data/brick/hlog.commonLogServer/1/7/-000000000027.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/5
/data/hibari1_data/brick/hlog.commonLogServer/1/2
/data/hibari1_data/brick/hlog.commonLogServer/1/2/-000000000036.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/6
/data/hibari1_data/brick/hlog.commonLogServer/1/6/-000000000033.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/1/1
/data/hibari1_data/brick/hlog.commonLogServer/BAD-CHECKSUM
/data/hibari1_data/brick/hlog.commonLogServer/2
/data/hibari1_data/brick/hlog.commonLogServer/2/6
/data/hibari1_data/brick/hlog.commonLogServer/2/1
/data/hibari1_data/brick/hlog.commonLogServer/2/1/-000000000028.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/5
/data/hibari1_data/brick/hlog.commonLogServer/2/5/-000000000025.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/2
/data/hibari1_data/brick/hlog.commonLogServer/2/7
/data/hibari1_data/brick/hlog.commonLogServer/2/7/-000000000034.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/3
/data/hibari1_data/brick/hlog.commonLogServer/2/3/-000000000037.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/2/4
/data/hibari1_data/brick/hlog.commonLogServer/2/4/-000000000031.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/s
/data/hibari1_data/brick/hlog.commonLogServer/s/000000000039.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/s/Config
/data/hibari1_data/brick/hlog.commonLogServer/3
/data/hibari1_data/brick/hlog.commonLogServer/3/1
/data/hibari1_data/brick/hlog.commonLogServer/3/1/-000000000035.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/6
/data/hibari1_data/brick/hlog.commonLogServer/3/6/-000000000026.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/2
/data/hibari1_data/brick/hlog.commonLogServer/3/2/-000000000029.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/5
/data/hibari1_data/brick/hlog.commonLogServer/3/5/-000000000032.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/7
/data/hibari1_data/brick/hlog.commonLogServer/3/4
/data/hibari1_data/brick/hlog.commonLogServer/3/4/-000000000038.HLOG
/data/hibari1_data/brick/hlog.commonLogServer/3/3
/data/hibari1_data/brick/hlog.commonLogServer/register
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch3_b3
/data/hibari1_data/brick/hlog.commonLogServer/register/bootstrap_copy1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch3_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch2_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/tab1_ch1_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch1_b1
/data/hibari1_data/brick/hlog.commonLogServer/register/stability_test_ch4_b2
/data/hibari1_data/brick/hlog.stability_test_ch1_b1
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s/000000000002.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/s/000000000003.HLOG
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/bad-sequence
/data/hibari1_data/brick/hlog.stability_test_ch1_b1/sequence_number
/data/hibari1_data/brick/hlog.tab1_ch1_b1
/data/hibari1_data/brick/hlog.tab1_ch1_b1/s
/data/hibari1_data/brick/hlog.tab1_ch1_b1/bad-sequence
/data/hibari1_data/brick/hlog.tab1_ch1_b1/sequence_number

# df -h
ファイルシス                               サイズ  使用  残り 使用% マウント位置
zones/****                                   121G  9.1G  112G    8% /
/lib                                         269M  246M   24M   92% /lib
/lib/svc/manifest                            2.3T  567K  2.3T    1% /lib/svc/manifest
/lib/svc/manifest/site                       121G  9.1G  112G    8% /lib/svc/manifest/site
/sbin                                        269M  246M   24M   92% /sbin
/usr                                         377M  353M   24M   94% /usr
/usr/ccs                                     121G  9.1G  112G    8% /usr/ccs
/usr/local                                   121G  9.1G  112G    8% /usr/local
swap                                         8.0G   86M  8.0G    2% /etc/svc/volatile
/usr/lib/libc/libc_hwcap1.so.1               377M  353M   24M   94% /lib/libc.so.1
swap                                         8.0G   86M  8.0G    2% /tmp
swap                                         8.0G   86M  8.0G    2% /var/run

Some thoughts

Hibari 0.3.0 is going to introduce a new client API brick_simple:rename/6, which internally uses ?KEY_SWITCHAROO. There may be a chance that the scavenger is not aware of this operation.
OR, the scavenger silently failed updating the log file locations of those 2 purged keys under high workload.
I have ran the same stability tests on the following machines before tying Joyent Cloud:
- Ran 2 times and found no issue on a Zone-based SmartOS machine (2GB RAM, 8 vCPUs) on my Mac mini (Late 2012, Core i7 at 2.6GHz) + VMware fusion. Pkgsrc is 2012Q4/X86_64, not 2012Q4/i386.
- Ran 1 time and found no issue on a CentOS 6.4 virtual machine (6GB RAM, 4 vCPUs) on my Mac mini + VMware fusion.

By the way, I wanted to try Joyent Cloud just because to play with dtrace-based sub-second offset heat maps, which is cool: http://dtrace.org/blogs/brendan/2012/03/26/subsecond-offset-heat-maps/

This wouldn't be related to this GitHub issue, but I'm curious if having smp:24:24 is OK while number of vCPUs in the Solaris Zone was capped to 4.

Hibari - erlang.log.1

=====
===== LOGGING STARTED Fri Mar 29 23:06:25 CST 2013
=====
Exec: /opt/hibari/hibari1/hibari/erts-5.9.2/bin/erlexec -boot /opt/hibari/hibari1/hibari/releases/0.3.0/hibari -mode embedded -config /opt/hibari/hibari1/hibari/releases/0.3.0/sys.config -args_file /opt/hibari/hibari1/hibari/releases/0.3.0/vm.args -- console
Root: /opt/hibari/hibari1/hibari
Erlang R15B02 (erts-5.9.2) [source] [smp:24:24] [async-threads:64] [kernel-poll:true] [dtrace]

Still investigating this issue, and updated gdss-brick to make the logging more informative.

Commit: https://github.com/hibari/gdss-brick/commit/4dc196ea53ac8bf377d007f3fa787750d3789ae9

hibari >> GH33 (bad sequence file) related:

Revise and add log messages to record purged keys in console.log.

Rise logging level from [info] to [notice].

log messages - before

[warning] ... tab_ch3_b3: error {error,enoent} at 1 36210172

[info] ... tab_ch3_b3: purging keys with sequence 1, size 259805
[info] ... tab_ch3_b3: N purged keys for log 1

log messages - after

[warning] ... tab_ch3_b3: read error {error,enoent} at seq 1 offset 36210172
for the value of key <<"...">>

[notice] ... tab_ch3_b3: purging keys with sequence 1, size 259805
[notice] ... tab_ch3_b3: purged a key from log. key <<"...">> 
...
[notice] ... tab_ch3_b3: purged a key from shadowtab. key <<"...">> 
...
[notice] ... tab_ch3_b3: purged N keys for log 1

I rebuilt Hibari 0.3.0 RC with the above logging change, deleted the table, and restarted the same 8-hour stability test on the same smartmachine at Joyent Cloud. Hope this can reproduce the issue.

It did reproduce the same issue, on the same node hibari1@127.0.0.1. That's the only node who is running admin server.

Reply to comment

I have ran the same stability tests on the following machines before tying Joyent Cloud:

Ran 2 times and found no issue on a Zone-based SmartOS machine (2GB RAM, 8 vCPUs) on my Mac mini (Late 2012, Core i7 at 2.6GHz) + VMware fusion. Pkgsrc is 2012Q4/X86_64, not 2012Q4/i386.

Ran 1 time and found no issue on a CentOS 6.4 virtual machine (6GB RAM, 4 vCPUs) on my Mac mini + VMware fusion.

I selected 32-bit version of SmartOS on Joyent Cloud actually by mistake. I used 64-bit version of SmartOS and CentOS for my local environment described above. Perhaps, this issue is related to 32-bit Erlang VM? maybe only on illmuos kernel?

To confirm this, I'm running the same test on the following machines now:

32-bit SmartOS on Joyent Cloud
64-bit SmartOS on Joyent Cloud
64-bit SmartOS on my Mac mini + VMware Fusion

Note that Erlang/OTP version is R15B02 for SmartOS, and R15B03-1 for CentOS.

32-bit SmartOS on Joyent Cloud

64-bit SmartOS on Joyent Cloud

64-bit SmartOS on my Mac mini + VMware Fusion

Unfortunately, I was able to reproduce the same issue on # 1 and # 3. I wonder why my earlier attempts on # 3 didn't fail. I don't have a record but I think I ran scavenger only on the first node. (Today, I'm running scavenger on all 4 nodes at the same time.) Or maybe it would happen more often on 32-bit version than 64-bit version; # 2 is still running without the issue.

For # 1 (32-bit), it again happend on hibari1@127.0.0.1 which had admin server running.
For # 3 (64-bit), it happend on hibari4@127.0.0.1.

Log from # 3

2013-03-30 19:58:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330195823 [{2,3},{197,201},{1113712,1113712}]
2013-03-30 19:59:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330195923 [{1,3},{199,201},{1113712,1113712}]
2013-03-30 20:00:02.354 [info] <0.610.0>@gmt_hlog_common:schedule_next_daily_scavenger:1496 Scheduling next scavenger 86398 seconds from now.
2013-03-30 20:00:02.354 [warning] <0.610.0>@gmt_hlog_common:handle_info:348 async writeback proc <0.7652.27> hasn't finished yet
2013-03-30 20:00:02.356 [info] <0.7651.27>@brick_ets:make_info_log_fun:3561 SCAV: start with [stability_test_ch2_b3,stability_test_ch4_b1,stability_test_ch3_b2]
2013-03-30 20:00:02.358 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 Scavenger <0.7654.27> starting now.
2013-03-30 20:00:04.017 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,52},{old_heap_block_size,514229},{heap_block_size,121393},{mbuf_size,0},{stack_size,72},{old_heap_size,441054},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{brick_ets,storetuple_flags,1}}]
2013-03-30 20:00:06.032 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,54},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,94604},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:06.329 [info] <0.8774.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 297 msec for 1 callers
2013-03-30 20:00:07.486 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,68},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,94614},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:08.174 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.7654.27>,long_gc,info_filtered}
2013-03-30 20:00:10.275 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,74},{old_heap_size,293923},{heap_size,84042}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{brick_ets,'-prepend_rs/3-lc$^0/1-0-',3}}]
2013-03-30 20:00:13.994 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,62},{old_heap_size,59717},{heap_size,35049}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{dict,get_slot,2}}]
2013-03-30 20:00:16.101 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,52},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,76},{old_heap_size,0},{heap_size,96281}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{timer,sleep,1}}]
2013-03-30 20:00:16.466 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.7654.27>,long_gc,[{timeout,51},{old_heap_block_size,317811},{heap_block_size,121393},{mbuf_size,0},{stack_size,62},{old_heap_size,59717},{heap_size,35056}]} pinfo: [{registered_name,the_scavenger_proc},{initial_call,{erlang,apply,2}},{current_function,{dict,store,3}}]
2013-03-30 20:00:16.746 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 1
2013-03-30 20:00:16.924 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 3
2013-03-30 20:00:17.460 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,84},{old_heap_block_size,0},{heap_block_size,832040},{mbuf_size,0},{stack_size,55},{old_heap_size,0},{heap_size,460242}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_bin,2}}]
2013-03-30 20:00:18.188 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,354},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,50},{old_heap_size,639591},{heap_size,189660}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,fun_loop,6}}]
2013-03-30 20:00:19.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 2 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:19.200 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,70},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,48},{old_heap_size,1561208},{heap_size,370568}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_term,2}}]
2013-03-30 20:00:20.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:20.446 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,135},{old_heap_block_size,514229},{heap_block_size,196418},{mbuf_size,0},{stack_size,50},{old_heap_size,205485},{heap_size,51143}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,monitor_request,2}}]
2013-03-30 20:00:21.182 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 3 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:21.186 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,147},{old_heap_block_size,0},{heap_block_size,1346269},{mbuf_size,0},{stack_size,55},{old_heap_size,0},{heap_size,914833}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_bin,2}}]
2013-03-30 20:00:22.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:22.251 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,218},{old_heap_block_size,1682835},{heap_block_size,832040},{mbuf_size,0},{stack_size,48},{old_heap_size,1392494},{heap_size,514196}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{file_sorter,to_term,2}}]
2013-03-30 20:00:22.810 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12838.27>,long_gc,[{timeout,89},{old_heap_block_size,832040},{heap_block_size,514229},{mbuf_size,0},{stack_size,23},{old_heap_size,318069},{heap_size,196134}]} pinfo: [{initial_call,{proc_lib,init_p,5}},{current_function,{disk_log_1,logl,3}}]
2013-03-30 20:00:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200023 [{1,3},{201,201},{1113712,1113712}]
2013-03-30 20:00:23.565 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,75},{old_heap_block_size,0},{heap_block_size,1346269},{mbuf_size,0},{stack_size,51},{old_heap_size,0},{heap_size,681155}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,monitor_request,2}}]
2013-03-30 20:00:24.174 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:24.254 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,51},{old_heap_block_size,0},{heap_block_size,2103540},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,1403643}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,merge3_12_3,6}}]
2013-03-30 20:00:25.629 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,353},{old_heap_block_size,0},{heap_block_size,1682835},{mbuf_size,0},{stack_size,43},{old_heap_size,0},{heap_size,1417808}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,rmerge3_1,6}}]
2013-03-30 20:00:26.179 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:26.546 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,171},{old_heap_block_size,2629425},{heap_block_size,832040},{mbuf_size,0},{stack_size,27763},{old_heap_size,725412},{heap_size,234623}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,terms2bins,1}}]
2013-03-30 20:00:27.182 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 1 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:27.664 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,76},{old_heap_block_size,2629425},{heap_block_size,514229},{mbuf_size,0},{stack_size,42},{old_heap_size,1331163},{heap_size,78432}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{lists,mergel,2}}]
2013-03-30 20:00:28.176 [info] <0.580.0>@gmt_sysmon_server:handle_info:122 2 extra: {monitor,<0.12598.27>,long_gc,info_filtered}
2013-03-30 20:00:30.196 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.12598.27>,long_gc,[{timeout,121},{old_heap_block_size,0},{heap_block_size,2629425},{mbuf_size,0},{stack_size,58521},{old_heap_size,0},{heap_size,704062}]} pinfo: [{initial_call,{erlang,apply,2}},{current_function,{disk_log,terms2bins,1}}]
2013-03-30 20:00:30.421 [info] <0.580.0>@gmt_sysmon_server:handle_info:113 {monitor,<0.13593.27>,long_gc,[{timeout,53},{old_heap_block_size,987},{heap_block_size,317811},{mbuf_size,0},{stack_size,2},{old_heap_size,304},{heap_size,312279}]} pinfo: [{initial_call,{proc_lib,init_p,5}},{current_function,{disk_log,log_end,4}}]
2013-03-30 20:00:34.318 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 4: external child proc, sizes 65 and 98
2013-03-30 20:00:34.319 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 5
2013-03-30 20:00:34.324 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 1 percent 75.8263236019884
2013-03-30 20:00:35.332 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 3 percent 75.91776818694566
2013-03-30 20:00:35.332 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 4 percent 75.94176434380742
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 5 percent 76.15638196194216
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 6 percent 76.25145809103331
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 7 percent 76.29444069407096
2013-03-30 20:00:35.333 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 8 percent 76.47620704197605
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 9 percent 76.54912658858676
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 10 percent 76.68048296028171
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 11 percent 76.77855474674008
2013-03-30 20:00:35.334 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer: seq 12 percent 76.88504257958091
2013-03-30 20:00:35.335 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 6
2013-03-30 20:00:35.337 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 7
2013-03-30 20:00:37.691 [info] <0.14180.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 390 msec for 1 callers
2013-03-30 20:00:37.740 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 89 ms
2013-03-30 20:00:37.921 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 59 ms
2013-03-30 20:00:38.241 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 80 ms
2013-03-30 20:00:38.341 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 80 ms
2013-03-30 20:00:39.008 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 556 ms
2013-03-30 20:00:39.035 [info] <0.14204.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 654 msec for 1 callers
2013-03-30 20:00:39.375 [info] <0.14228.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 205 msec for 1 callers
2013-03-30 20:00:55.542 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 1 (true)
2013-03-30 20:00:56.777 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 1: UpRes ok, 38825 updates
2013-03-30 20:01:15.154 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 3 (true)
2013-03-30 20:01:17.499 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 3: UpRes ok, 38902 updates
2013-03-30 20:01:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200123 [{1,3},{206,206},{2914352,2914352}]
2013-03-30 20:01:36.159 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 4 (true)
2013-03-30 20:01:38.169 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 4: UpRes ok, 38895 updates
2013-03-30 20:01:56.828 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 5 (true)
2013-03-30 20:01:58.315 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 5: UpRes ok, 39020 updates
2013-03-30 20:02:16.874 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 147 ms
2013-03-30 20:02:17.186 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 6 (true)
2013-03-30 20:02:18.744 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 6: UpRes ok, 39046 updates
2013-03-30 20:02:22.096 [info] <0.22524.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 3351 msec for 1 callers
2013-03-30 20:02:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200223 [{1,3},{207,207},{1113712,2914352}]
2013-03-30 20:02:40.995 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 7 (true)
2013-03-30 20:02:42.385 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 7: UpRes ok, 39065 updates
2013-03-30 20:03:01.528 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 8 (true)
2013-03-30 20:03:02.753 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 8: UpRes ok, 39188 updates
2013-03-30 20:03:09.928 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 65 ms
2013-03-30 20:03:10.962 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 176 ms
2013-03-30 20:03:21.257 [info] <0.27351.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 241 msec for 1 callers
2013-03-30 20:03:21.446 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 9 (true)
2013-03-30 20:03:22.895 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 9: UpRes ok, 39232 updates
2013-03-30 20:03:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200323 [{0,3},{207,207},{1113712,2914352}]
2013-03-30 20:03:40.726 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 10 (true)
2013-03-30 20:03:42.312 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 10: UpRes ok, 39267 updates
2013-03-30 20:03:59.104 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 62 ms
2013-03-30 20:04:00.083 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 11 (true)
2013-03-30 20:04:01.613 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 11: UpRes ok, 39334 updates
2013-03-30 20:04:02.303 [info] <0.30860.27>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 446 msec for 1 callers
2013-03-30 20:04:21.356 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: Updating locations for sequence 12 (true)
2013-03-30 20:04:21.751 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200423 [{1,3},{205,207},{1114896,2914352}]
2013-03-30 20:04:22.715 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer sequence 12: UpRes ok, 39366 updates
2013-03-30 20:04:22.716 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 SCAV: commonLogServer finished step 10
2013-03-30 20:04:22.716 [info] <0.7654.27>@brick_ets:make_info_log_fun:3561 Scavenger finished:
        Options: [destructive,{skip_live_percentage_greater_than,90},{work_dir,"/scav/hibari4_scavenger"}]
        Logs deleted without scavenging: 1 (12 bytes)
        Logs scavenged: 11
        Copied: hunks bytes errs = 430140 440463360 0
        Reclaimed bytes = 136500433
2013-03-30 20:05:22.357 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200523 [{1,3},{199,207},{1572376,2914352}]
2013-03-30 20:06:22.970 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200623 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:07:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:08:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:09:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330200923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:10:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201023 [{1,3},{203,207},{1113712,2914352}]
2013-03-30 20:11:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:12:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201223 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:13:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:14:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:15:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201523 [{0,3},{202,207},{1113712,2914352}]
2013-03-30 20:16:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:17:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:18:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:19:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330201923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:20:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:21:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:22:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202223 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:23:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:24:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:25:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202523 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:26:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202623 [{1,3},{198,207},{1113712,2914352}]
2013-03-30 20:27:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:28:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202823 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:29:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330202923 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:30:23.197 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203023 [{0,3},{202,207},{1113712,2914352}]
2013-03-30 20:31:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:32:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:33:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:34:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:35:23.195 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:36:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:36:48.518 [info] <0.27121.32>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 1113 msec for 1 callers
2013-03-30 20:37:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:38:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203823 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:39:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330203923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:40:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204023 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:41:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:42:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204223 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:43:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204323 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:44:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204423 [{0,3},{203,207},{1113712,2914352}]
2013-03-30 20:45:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204523 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:46:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:47:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204723 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:48:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204823 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:49:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330204923 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:50:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:51:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205123 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:52:23.199 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:53:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205323 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:54:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205423 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:55:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:56:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205623 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 20:57:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205723 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 20:58:12.566 [info] <0.1358.36>@gmt_hlog:spawn_sync_asyncly:1357 commonLogServer sync was 453 msec for 1 callers
2013-03-30 20:58:23.193 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205823 [{0,3},{200,207},{1113712,2914352}]
2013-03-30 20:59:15.233 [info] <0.611.0>@gmt_hlog:handle_call:526 Write to "data/brick/hlog.commonLogServer" took 79 ms
2013-03-30 20:59:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330205923 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:00:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210023 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:01:23.192 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210123 [{1,3},{198,207},{1113712,2914352}]
2013-03-30 21:02:23.198 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210223 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 21:03:23.200 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210323 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:04:23.194 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210423 [{1,3},{202,207},{1113712,2914352}]
2013-03-30 21:05:23.201 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210523 [{1,3},{200,207},{1113712,2914352}]
2013-03-30 21:05:48.045 [warning] <0.6250.37>@brick_ets:bigdata_dir_get_val:2925 stability_test_ch2_b3: read error {error,enoent} at seq 12 offset 12913308 the stored value of key <<51,51,47,0,11,75,39>>
2013-03-30 21:05:48.046 [warning] <0.610.0>@brick_ets:sequence_file_is_bad_common:3025 Brick commonLogServer bad sequence file: seq 12 offset 12913308: rename data/brick/hlog.commonLogServer/s/000000000012.HLOG or data/brick/hlog.commonLogServer/1/6/-000000000012.HLOG to data/brick/hlog.commonLogServer/BAD-CHECKSUM/12.at.offset.12913308 failed: enoent
2013-03-30 21:05:48.048 [notice] <0.744.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch2_b3: purging keys with sequence 12, size 92306
2013-03-30 21:05:48.049 [notice] <0.716.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch4_b1: purging keys with sequence 12, size 294398
2013-03-30 21:05:48.051 [notice] <0.730.0>@brick_ets:purge_recs_by_seqnum:2006 stability_test_ch3_b2: purging keys with sequence 12, size 220454
2013-03-30 21:05:48.091 [notice] <0.744.0>@brick_ets:purge_rec_from_log:2029 stability_test_ch2_b3: purged a key from log. key <<51,51,47,0,11,75,39>>
2013-03-30 21:05:48.127 [notice] <0.744.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch2_b3: purged 1 keys for log 12
2013-03-30 21:05:48.151 [notice] <0.730.0>@brick_ets:purge_rec_from_log:2029 stability_test_ch3_b2: purged a key from log. key <<52,48,47,0,1,140,130>>
2013-03-30 21:05:48.247 [notice] <0.730.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch3_b2: purged 1 keys for log 12
2013-03-30 21:05:48.324 [notice] <0.716.0>@brick_server:do_common_log_sequence_file_is_bad:5555 common_log_sequence_file_is_bad: stability_test_ch4_b1: purged 0 keys for log 12
2013-03-30 21:05:48.708 [info] <0.615.0>@brick_mboxmon:handle_info:152 node 'basho_bench@127.0.0.1' down Extra [{nodedown_reason,connection_closed},{node_type,visible}]
2013-03-30 21:05:50.402 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch3_b2,'hibari4@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 21:05:50.487 [info] <0.530.0> alarm_handler: {set,{{disk_error,{stability_test_ch2_b3,'hibari4@127.0.0.1'}},"Administrator intervention is required."}}
2013-03-30 21:05:50.626 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=38
2013-03-30 21:05:50.640 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=39
2013-03-30 21:05:50.717 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=40
2013-03-30 21:05:50.730 [info] <0.624.0>@brick_simple_client:handle_call:290 brick_simple(set_gh): minor_rev=41
2013-03-30 21:06:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210623 [{1,3},{166,207},{1113712,2914352}]
2013-03-30 21:07:23.196 [info] <0.596.0>@congestion_watcher_server:do_mark:233 mark: 20130330210723 [{1,3},{139,207},{1113712,2914352}]

hibari-smartos1% sm-summary
* Gathering SmartMachine summary..
SM UUID             78c849af-be16-4463-a90a-4e85bbdef5eb
SM ID               1
Hostname            hibari-smartos1
SmartOS build       joyent_20130321T213641Z
Image               base64 1.9.1
Base Image          NA
Documentation       http://wiki.joyent.com/jpc2/SmartMachine+Base
Pkgsrc              http://pkgsrc.joyent.com/packages/SmartOS/2012Q4/x86_64/All
Processes           71
Memory (RSS) Cap    2048M
Memory (RSS) Used   1160M
Memory (RSS) Free   888M
Swap Cap            2048M
Swap Used           1113M
/tmp Used           16K
Disk Quota          21G
% Disk Used         24%

hibari-smartos1% cat /opt/hibari/hibari1/hibari/log/erlang.log.1

=====
===== LOGGING STARTED Sat Mar 30 16:54:16 CST 2013
=====
Exec: /opt/hibari/hibari1/hibari/erts-5.9.2/bin/erlexec -boot /opt/hibari/hibari1/hibari/releases/0.3.0/
hibari -mode embedded -config /opt/hibari/hibari1/hibari/releases/0.3.0/sys.config -args_file /opt/hibar
i/hibari1/hibari/releases/0.3.0/vm.args -- console
Root: /opt/hibari/hibari1/hibari
Erlang R15B02 (erts-5.9.2) [source] [64-bit] [smp:8:8] [async-threads:64] [kernel-poll:true] [dtrace]

hibari-smartos1% prstat -Z

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP
 41322 hibari     94M   79M sleep   59    0   0:10:15 1.5% beam.smp/30
 40844 hibari     67M   51M sleep   59    0   0:06:10 0.8% beam.smp/93
 40881 hibari     62M   45M sleep   58    0   0:04:25 0.6% beam.smp/94
 40920 hibari     56M   42M sleep   55    0   0:04:02 0.6% beam.smp/94
 40960 hibari     58M   42M sleep   51    0   0:03:44 0.5% beam.smp/93
 41466 root     4472K 3728K cpu22   59    0   0:00:01 0.0% prstat/1
 40842 hibari   4084K 2064K sleep   59    0   0:00:00 0.0% run_erl/1
 40982 hibari   3972K 2548K sleep    1    0   0:00:00 0.0% inet_gethost/1
 40964 hibari   3972K 2548K sleep   55    0   0:00:00 0.0% inet_gethost/1
 40959 hibari   4084K 2064K sleep   56    0   0:00:00 0.0% run_erl/1
 41222 hibari   4000K 2484K sleep   59    0   0:00:00 0.0% gtail/1
 76138 hibari   5048K 3404K sleep   59    0   0:00:00 0.0% ksh93/1
 65581 root     2476K 1572K sleep   59    0   0:00:00 0.0% pfexecd/3
 65719 root     6612K 3292K sleep   59    0   0:00:00 0.0% inetd/3
 76187 root     5524K 3180K sleep   59    0   0:00:00 0.0% sshd/1
 40928 hibari   4036K 2000K sleep   59    0   0:00:00 0.0% inet_gethost/1
 76134 hibari   7696K 4464K sleep   59    0   0:00:00 0.0% sshd/1
 41230 hibari   4000K 2484K sleep   59    0   0:00:00 0.0% gtail/1
 77086 hibari   5048K 3436K sleep   59    0   0:00:00 0.0% ksh93/1
 76194 hibari   5048K 3404K sleep   59    0   0:00:00 0.0% ksh93/1
 65705 root     1864K 1296K sleep   59    0   0:00:00 0.0% cron/1
 40879 hibari   4084K 2064K sleep   59    0   0:00:00 0.0% run_erl/1
 76863 hibari   5048K 3404K sleep   59    0   0:00:00 0.0% ksh93/1
 41190 hibari   4000K 2484K sleep   59    0   0:00:00 0.0% gtail/1
 65538 netadm   4568K 3276K sleep   59    0   0:00:00 0.0% ipmgmtd/3
 76050 hibari   3332K 2452K sleep   59    0   0:00:00 0.0% bash/1
 75788 root     7696K 4464K sleep   59    0   0:00:00 0.0% sshd/1
 65717 root     2160K 1312K sleep   59    0   0:00:00 0.0% sac/1
 75813 root     3340K 2456K sleep   59    0   0:00:00 0.0% bash/1
 76145 hibari   3332K 2452K sleep   59    0   0:00:00 0.0% bash/1
 41343 hibari   4036K 2000K sleep    1    0   0:00:00 0.0% inet_gethost/1
 75787 root     5524K 3180K sleep   59    0   0:00:00 0.0% sshd/1
 77066 hibari   6700K 4932K sleep    1    0   0:00:09 0.0% tmux/1
 65456 root       13M   12M sleep   59    0   0:00:05 0.0% svc.configd/14
 77099 hibari   3340K 2460K sleep   59    0   0:00:00 0.0% bash/1
 76030 root     5524K 3180K sleep   59    0   0:00:00 0.0% sshd/1
 76040 hibari   5048K 3404K sleep   59    0   0:00:00 0.0% ksh93/1
 76795 hibari   7696K 4464K sleep   59    0   0:00:00 0.0% sshd/1
 65722 root     1624K  944K sleep   59    0   0:00:00 0.0% utmpd/1
ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE
   139       71  646M  456M   0.4%   0:28:57 3.9% 4f80810f-1aa5-4df3-9043-a625c327094d

Total: 71 processes, 540 lwps, load averages: 0.94, 0.91, 0.82

Reply to comment

Approach 2. Track rename operation during scavenger process. While scavenger is running, record rename info key: OldKey, value: NewKey to a new ETS table (e.g key: <<"a/1">>, value: <<"a/2">>). If scavenger fails step 3 for key <<"a/1">>, it reads the ETS table to get key <<"a/2">> and retries step 3. This seems a most efficient approach as all additional steps are in-memory operations. However there will be a chance that the scavenger will never catch up to rename operations if <<"a/2">> is renamed to <<"a/3">>, then to <<"a/4">> and so on very quickly.

Thinking about two cases:

the brick server crashes during scavenging
suspending/resuming the scavenger

1 should be okay because the scavenger will be restarted from the first step after the brick server is recovered. 2 will be a bit problematic; the brick server has to keep recording rename operations to the ETS table while scavenger is suspended.

Perhaps, tombstone-like approach on the old key will be more robust? I'll explain it in next comment.

Perhaps, tombstone-like approach on the old key will be more robust? I'll explain it in next comment.

Approach 4. Tombstone-like moved marker on the old key.

Currently, rename/6 operation runs delete/4 with testset on the old key at the end. Instead, this approach makes rename/6 to run set/6 with teset and a custom property (and empty data) to the old key to indicate this key has been renamed to the new key. This write will be recorded to the write-ahead log just like other write operations, and rename/6 may take this behavior no matter if the scavenger is running/suspended or not; to be sure there is no race conditions.

All read (get/6 and get_many/4) and write (set/6 etc.) operations should be aware of this property and handle the old key as deleted (as well as brick repair process.) The scavenger will use the custom property to track rename operations, and it runs delete/4 with testset in the last step of scavenging.

This approach will address "suspending/resuming the scavenger" issue but still has "will never catch up to rename operations" issue. So the scavenger should have max retry count for the attempts.

Getting ready for fix; refactoring etc.

gdss-admin (on a topic branch) - commit https://github.com/hibari/gdss-admin/commit/512bc92cc3d32ad494e3349ea557a4b4f7ec49ab

hibari >> gh33 (bad sequence on common log deleted by the scavenger):

Updade brick_admin:fast_sync_scav/6 to use renamed bottom_fun field of #scav{}

gdss-brick (on a topic branch) - commit https://github.com/hibari/gdss-brick/commit/90b3a70de1fc2a09666a35de6cc17b7edf056dd5

hibari >> gh33 (bad sequence on common log deleted by the scavenger):

Temporary changes

For debugging, temporary make the scavenger to run every two minutes

Refactoring the scavenger code

Breakdown gigantic gmt_hlog_commonlog:scavenger_commonlog/1 function into five smaller functions

Rename gmt_hlog_common:scavenger_commonlog_bottom10/7 to scavenger_commonlog_bottom/2

Update #scav{}; add dead_paths, dead_seq_bytes, and live_seq_bytes fields, and rename phase10_fun field to bottom_fun

Remove parameters like TmpPath and Finfolog form various scavenger functions because they are also in the first parameter #scav{}

Change variable names, for example, from cryptic Tmp4List to LiveHunkSizesOrderBySeq

Delete codes just for debugging. e.g. os:cmd("find -name ...") and writing a file with debug info. Instead, add some debug level logging

brick_ets module: Remove unused functions make_info_log_fun/1 and scavenge_one_seq_file_fun/4

Change some of "==" and "/=" to "=:=" and "=/="

Change some of "if X == true" and "if X == false" to "if X" and "if not X" when X is guaranteed to be boolean type

Enhancements on the scavenging process

Move the step to get all eligible sequence numbers to very top, so that the scavenger can quit when there is no such sequences

Reorganize the contents of scavenger resume file. It's no longer compatible with earlier version

Add more info and debug level logs

Add more info level logs to each steps of scavenging process

Add info-level log to gmt_hlog module:create_new_log_file/2, safe_log_close_short/1 and safe_log_close_long/2

Add ?WAIT_BEFORE_EXIT to ensure error log to be written before exit

Not to use the logger from brick_ets:make_info_log_fun/1 but ?E_INFO(Fmt, Args)

Change ?E_DGB(Cat, Fmt, Args) to ?E_DBG(Fmt, Args)

Getting ready for fix; refactoring etc.

gdss-admin (on a topic branch) - commit https://github.com/hibari/gdss-admin/commit/0f2c65ffcd0c4a4ee7d0767e9729a7c267894f11

hibari >> gh33 (bad sequence on common log deleted by the scavenger):

Update brick_admin:fast_sync_scav/6 not to set log_fun in #scav{} record. (That function is not used in gdss-admin anyway)

gdss-brick (on a topic branch) - commit https://github.com/hibari/gdss-brick/commit/8cf9850fefc548277e5facee773cd83658fe7eba

hibari >> gh33 (bad sequence on common log deleted by the scavenger):

Refactoring hlog, hlog_common, hlog_local, and brick_ets modules:

Update #scav{} in brick.hrl: remove log_fun field, add type spec on each field

Reorganize hlog deletion codes in the checkpoint process

Add spec to some functions

Move count_live_bytes_in_log/1 from brick_ets module to gmt_hlog_common

Delete delete_seq/2 from brick_ets module. Add a similar function (but with more informative info level log) to gmt_hlog_common module

Change some of "if" expressions to "case"

Change some of "==" and "/=" to "=:=" and "=/=" when possible

Delete debug print like io:format/2

Add more info and debug level logs

Add file size information to various info level logs

Add info level logs for local hlog creation and deletion

Add info and debug level logs to the checkpoint process

Disable ?DBG_TLOG logs in write_back functions. Will update them later with more helpful debug messages

Getting ready for fix; scavenging process enhancement.

gdss-admin (on a topic branch) - commit https://github.com/hibari/gdss-admin/commit/a1d7b2fd674321426495f15e1272d94299757c97

hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):

Update brick_admin module to work with #scav{} record with destructive field deleted

gdss-brick (on a topic branch) - commit https://github.com/hibari/gdss-brick/commit/70c13cb1f1bf488fcdb41358939441cf6aa31182

hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):

Enhancements on the scavenging process:

Check eligible sequence numbers before creating their work files, so that the scavenger can avoid creating and sorting the contents of unnecessary work files

Optimize steps for (1) copying live hunks to the latest hlog and (2) updating their locations, so that we can eliminate the resuming scavenger feature: * Before: Run (1) on all live hunks (e.g. 100 thousands), then run (2) on all copied hunks * After: Run (1) on a chunk of live hunks (e.g. 100), then run (2) on those hunks, go to (1) and repeat. The chunk size is controlled by disk_log module in Erlang/OTP

Remove resume_scavenger_commonlog/1 and related functions

Remove 'destructive' and 'skip_reads' fields. They only appear to be useful for debugging the scavenging process

Refactoring hlog_common and brick_ets modules:

Update #scav{}: remove update_locations. We don't switch that function

Rewrite "try catch" expression and their child "begin end" and "if" expressions in brick_ets:copy_one_hunk/6 to be a proper "try catch" syntax

Add more info level logs:

Add file size, copied hunks, and copied bytes information to some info level logs

Getting ready for fix; scavenger process enhancement.

gdss-brick (on a topic branch) - commit https://github.com/hibari/gdss-brick/commit/7749814bffab67e297d77aa1d9dd2aa5caf7c027

hibari >> gh33 (bad sequence) and gh34 (scavenger enhancements):

Enhancements on the scavenging process:

the fun created by gmt_hlog_common:scavenge_one_seq_file_fun/1 now handles unexpected errors so that scavenging process can continue on other seq files.

gmt_hlog_common:scavenger_move_hunks1/8 now try to continue moving live hunks even if some errors has been occurred.

gmt_hlog_common:scav_check_shutdown/1 now reports a summary of canceled scavenging process. (done by adding #scav_progress{} and scav_print_summary/2) It now deletes work files too.

scavenger_commonlog_bottom/1 now skips to copy live hunks when is no such hunks

A bug fix in write_hunk_bytes of gmt_hlog_local:

fix a potential bug in handle_call/3 for 'write_hunk_bytes' with 'metadata', which causes to return a wrong offset. (The caller doesn't use the returned value so this bug had no actual impact)

Refactoring hlog_common and brick_ets modules:

Rename type eee() in gmt_hlog.hrl to metadata_tuple()

Moved an expression to create a metadata_tuple() from gmt_hlog_local:handle_call/3 to make_metadata_tuple/7 for clarity

Add live_hunk_sizes field to #scav{} record

Add a type called scavenger_move_hunks_result

Add type do_mod() and some specs to brick_ets module.

Remove unused 'relocate_*' fields from #wb{} record in gmt_hlog_common module

Replace some "if" expressions with pattern matches

Misc

gdss-brick >> GH9 (refactor #state{} record) - Add some comments to #state{} record in brick_ets.erl

Replace double-space with single-space in the @ doc in gmt_hlog_local module.

Replace "GDSS" with "Hibari" in the @ doc too.

OK. I'm much more comfortable working on the stuff relating hlog (write-ahead-log) and scavenging process. I want to work on one more scavenger enhancement, then I can proceed to the solution for this issue.

%% @ TODO Run step 5 before step 3 %% - Move the part to scan for live hunk bytes from step 3 to 5 %% - Update step 5 to delete the work files for the sequences that %% have grater live hunk percentages than the threshold %% - Move step 3 under step 5

I accidentally deleted a comment from 9 months ago while I was trying to quote a part of it. I am re-posting a part of the comment at the bottom of this comment, but I lost other parts :(

Anyway, during holiday, I developed prototypes for the approaches 2 and 3 below for some experiments.

Approach 2 almost worked but I found that it must have a write barrier (global write-lock in a physical brick) to avoid a race condition. Without the barrier, there is still a chance that a brick server will create a reference to a hunk in a log file that is being deleted by the common hlog process. I think a write barrier could live a few minutes in a busy system, so this will not be a good solution.
Approach 3 worked as expected, and I decided to take this approach. I introduced frozen state to log file; if a log file is frozen, rename/6 avoids to do ?KEY_SWITCHAROO and delete the old key, but instead, it loads a value of the old key-value from disk, sets the value to the new key, and deletes the old key. My prototype dynamically injects a get op to a do-op list to trigger the squidflash primer to asynchronously load the value from disk to OS disk cache, so it has a minimum impact to the performance.

A quote of the comment:

I reviewed the codes ... ... Scavenger works as the followings steps:

Lists hlog locations of all value-hunks in all bricks, sort them by hlog's sequence number.

Calculates each hlog's live percentage, and if it's lower than the setting, copy all live value-hunks to the latest hlog.

Perform do operations to update keys with new hlog locations of the copied value-hunks.

Delete the old hlogs having all value-hunks moved to the latest hlog.

If key <<"a/1">> is renamed to <<"a/2">> by ?KEY_SWITCHAROO in between step 1 and 3, step 3 for key <<"a/1">> will fail with key_not_exist. And the current implementation of scavenger silently ignores this, because before introducing rename/6, this only happens when the key <<"a/1">> has been deleted. This doesn't work with rename/6 as the value-hunk in the old hlog is still referenced from <<"a/2">>.

To address this, I think there are three approaches:

Don't delete old hlog file. If scavenger fails step 3 for key <<"a/1">>, skip step 4 (deleting hlog file) for the old hlog holding the value-hunk for <<"a/1">>. That hlog will be eventually deleted in other scavenger run on next day or so. This is safe but not good for reclaiming disk space.

Track rename operation during scavenger process. While scavenger is running, record rename info key: OldKey, value: NewKey to a new ETS table (e.g key: <<"a/1">>, value: <<"a/2">>). If scavenger fails step 3 for key <<"a/1">>, it reads the ETS table to get key <<"a/2">> and retries step 3. This seems a most efficient approach as all additional steps are in-memory operations. However there will be a chance that the scavenger will never catch up to rename operations if <<"a/2">> is renamed to <<"a/3">>, then to <<"a/4">> and so on very quickly.

Not to use ?KEY_SWITCHARO during scavenger process. While scavenger is running, rename/6 avoids doing ?KEY_SWITCHAROO and delete the old key, but instead, it gets the value of the old key, sets value to the new key, and deletes the old key. However doing get and set on the value in hlog (on the disk) is much slower than in-memory operations ?KEY_SWITCHAROO and delete. Also the value-hunk will be copied twice to the latest hlog; once by scavenger, once more by rename/6.

Approach 2 seems a yet better solution than others, but can't be done quickly as extra investigation will be needed to avoid race conditions etc. Approach 1 and 3 can be used for a quick temporary fix (3 can be a permanent fix too if 2 is technically difficult.)

Closing this issue as I already addressed this issue for Hibari v0.1.11.

For Hibari v0.1.11, rename/6 acts like logs files are always frozen. For more details, see: https://github.com/hibari/gdss-brick/issues/2
For Hibari v0.3.x, rename/6 will act as the approach 3 below. This will be done by hibari/gdss-brick #17 Redesign disk storage and checkpoint/scavenger processes.

Approach 3 worked as expected, and I decided to take this approach. I introduced frozen state to log file; if a log file is frozen, rename/6 avoids to do ?KEY_SWITCHAROO and delete the old key, but instead, it loads a value of the old key-value from disk, sets the value to the new key, and deletes the old key. My prototype dynamically injects a get op to a do-op list to trigger the squidflash primer to asynchronously load the value from disk to OS disk cache, so it has a minimum impact to the performance.

hibari / hibari

[Hibari release candidate] bad sequence file 'enoent' on a common log file deleted by the scavenger #33