Closed gapisback closed 1 year ago
NOTE: With an attempted fix for reported issue in the function (), when you re-run the new test case, we run into this new BTree-pack failure:
(gdb) run
Starting program: /home/agurajada/Code/splinterdb/build/debug/bin/unit/large_inserts_stress_test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Running 1 CTests, suite name 'large_inserts_stress', test case 'all'.
TEST 1/1 large_inserts_stress:test_issue_458_mini_destroy_unused_debug_assert btree_pack exceeded output size limit: req->num_tuples=6291456, req->max_tuples=6291456
btree_pack exceeded output size limit: req->num_tuples=6291456, req->max_tuples=6291456
This error is triggered by this code in btree_pack()
:
2985 while (SUCCESS(iterator_at_end(req->itor, &at_end)) && !at_end) {
2986 iterator_get_curr(req->itor, &tuple_key, &data);
2987 if (!btree_pack_can_fit_tuple(req, tuple_key, data)) {
2988 platform_error_log("%s(): req->num_tuples=%lu exceeded output size "
2989 "limit: req->max_tuples=%lu\n",
2990 __func__,
2991 req->num_tuples,
2992 req->max_tuples);
Curiously, the checker function btree_pack_can_fit_tuple()
does this: return req->num_tuples < req->max_tuples;
Seems like an off-by-1 error in the condition check.
Attn: @ajhconway , @rtjohnso -- Just getting your attention to this long-winded 'test-stabilization' fix attempts I am meandering through in the background: (We should talk in-person for me to make progress on this item.)
After fixing the original issue "((addr % page_size) == 0)".
and with an attempted fix for what seems to be an off-by-1 error in the btree_pack_can_fit_tuple()
check [ see previous comment] , if you re-run this single test case, it will run into this error from btree_pack()
:
TEST 1/1 large_inserts_stress:test_issue_458_mini_destroy_unused_debug_assert btree_pack(): req->num_tuples=6291457 exceeded output size limit: req->max_tuples=6291456
btree_pack(): req->num_tuples=6291457 exceeded output size limit: req->max_tuples=6291456
Debugging this, I confirmed that we only run into this from this code-flow:
Breakpoint 2, btree_pack (req=0x7fffffffbc90) at src/btree.c:2988
2988 platform_error_log("%s(): req->num_tuples=%lu exceeded output size "
(gdb) bt
#0 btree_pack (req=0x7fffffffbc90) at src/btree.c:2988
#1 0x00007ffff7f8479d in trunk_compact_bundle (arg=0x555557531680, scratch_buf=0x7ffff7adb040) at src/trunk.c:4797
#2 0x00007ffff7f727f7 in task_group_run_task (group=0x55555556cd40, assigned_task=0x555555d3c600) at src/task.c:481
#3 0x00007ffff7f730eb in task_group_perform_one (group=0x55555556cd40, queue_scale_percent=0) at src/task.c:712
#4 0x00007ffff7f73172 in task_perform_one_if_needed (ts=0x55555556a580, queue_scale_percent=0) at src/task.c:731
#5 0x00007ffff7f8a384 in trunk_insert (spl=0x7fffe6feb040, tuple_key=..., data=...) at src/trunk.c:6177
#6 0x00007ffff7f70f9c in splinterdb_insert_message (kvs=0x555555561c80, user_key=..., msg=...)
at src/splinterdb.c:470
#7 0x00007ffff7f71044 in splinterdb_insert (kvsb=0x555555561c80, user_key=..., value=...) at src/splinterdb.c:478
#8 0x000055555555991d in ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_run (
data=0x555555560240 <ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_data>)
at tests/unit/large_inserts_stress_test.c:81
#9 0x0000555555557f8f in ctest_main (argc=1, argv=0x7fffffffe758) at tests/unit/main.c:320
#10 0x0000555555557a34 in main (argc=1, argv=0x7fffffffe758) at tests/unit/main.c:159
And it's a valid failure condition:
(gdb) p req->num_tuples
$9 = 6291457
(gdb) p req->max_tuples
$10 = 6291456
There is already code to handle from this failure and cleanup; i.e.
4617 void
4618 trunk_compact_bundle(void *arg, void *scratch_buf)
[...]
4797 platform_status pack_status = btree_pack(&pack_req);
4798 if (!SUCCESS(pack_status)) {
4799 platform_default_log("btree_pack failed: %s\n",
4800 platform_status_to_string(pack_status));
4801 trunk_compact_bundle_cleanup_iterators(
4802 spl, &merge_itor, num_branches, skip_itor_arr);
4803 btree_pack_req_deinit(&pack_req, spl->heap_id);
4804 platform_free(spl->heap_id, req);
4805 goto out;
4806 }
Seems like on L4799 onwards, we do a clean back-out, so no stale BTree-pack should be residual.
This happens twice in this test case's run. Then we run into this hard assertion:
OS-pid=11463, OS-tid=11463, Thread-ID=0, Assertion failed at src/trunk.c:3543:trunk_inc_filter(): "filter->addr != 0".
Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:44
44 ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737349731584) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140737349731584, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007ffff7d10476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff7cf67f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x000055555555b495 in platform_assert_false (filename=0x7ffff7fa477b "src/trunk.c", linenumber=3543,
functionname=0x7ffff7fa8e90 <__FUNCTION__.66> "trunk_inc_filter", expr=0x7ffff7fa4d95 "filter->addr != 0",
message=0x7ffff7fa3fca "") at src/platform_linux/platform.c:377
#6 0x00007ffff7f7f594 in trunk_inc_filter (spl=0x7fffe6feb040, filter=0x7ffff43622d2) at src/trunk.c:3543
#7 0x00007ffff7f82042 in trunk_flush_into_bundle (spl=0x7fffe6feb040, parent=0x7fffffffbc30, child=0x7fffffffbaa0,
pdata=0x7ffff419458c, req=0x555557531a40) at src/trunk.c:4102
#8 0x00007ffff7f82786 in trunk_flush (spl=0x7fffe6feb040, parent=0x7fffffffbc30, pdata=0x7ffff419458c,
is_space_rec=0) at src/trunk.c:4214
#9 0x00007ffff7f82d2d in trunk_flush_fullest (spl=0x7fffe6feb040, node=0x7fffffffbc30) at src/trunk.c:4295
#10 0x00007ffff7f83cf8 in trunk_compact_bundle (arg=0x555557531680, scratch_buf=0x7ffff7adb040) at src/trunk.c:4642
#11 0x00007ffff7f727f7 in task_group_run_task (group=0x55555556cd40, assigned_task=0x555555d3c600) at src/task.c:481
#12 0x00007ffff7f730eb in task_group_perform_one (group=0x55555556cd40, queue_scale_percent=0) at src/task.c:712
#13 0x00007ffff7f73172 in task_perform_one_if_needed (ts=0x55555556a580, queue_scale_percent=0) at src/task.c:731
#14 0x00007ffff7f8a384 in trunk_insert (spl=0x7fffe6feb040, tuple_key=..., data=...) at src/trunk.c:6177
#15 0x00007ffff7f70f9c in splinterdb_insert_message (kvs=0x555555561c80, user_key=..., msg=...)
at src/splinterdb.c:470
#16 0x00007ffff7f71044 in splinterdb_insert (kvsb=0x555555561c80, user_key=..., value=...) at src/splinterdb.c:478
#17 0x000055555555991d in ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_run (
data=0x555555560240 <ctest_large_inserts_stress_test_issue_458_mini_destroy_unused_debug_assert_data>)
at tests/unit/large_inserts_stress_test.c:81
The error: trunk_inc_filter(): "filter->addr != 0" is very weird.
This means in this code block:
4000 trunk_bundle *
4001 trunk_flush_into_bundle(trunk_handle *spl, // IN
[...]
4095 child_sb->end_branch = trunk_end_branch(spl, child);
4096 for (uint16 i = 0; i < filter_count; i++) {
4097 routing_filter *child_filter =
4098 trunk_subbundle_filter(spl, child, child_sb, i);
4099 routing_filter *parent_filter =
4100 trunk_subbundle_filter(spl, parent, parent_sb, i);
4101 *child_filter = *parent_filter;
4102 trunk_inc_filter(spl, child_filter);
4103 }
On L4102, we are accessing a child_filter
page that has not been setup properly.
Under the debugger we find:
(gdb) p child_filter
$12 = (routing_filter *) 0x7ffff43622d2
(gdb) p *child_filter
$13 = {addr = 0, meta_head = 0, num_fingerprints = 0, num_unique = 0, value_size = 0}
Unclear - why this is an uninitialized filter handle.
Or we are getting some wrong data structure in the code above.
Curiously, parent_filter also appears 'bogus':
(gdb) p parent_filter
$14 = (routing_filter *) 0x7ffff419437a
(gdb) p *parent_filter
$15 = {addr = 0, meta_head = 0, num_fingerprints = 0, num_unique = 0, value_size = 0}
This will need further debugging.
The original issue reported in this issue (bug in routing_filter_prefetch()) has been fixed and integrated.
The sub-issues reported in this thread has been peeled of to its own standalone issue #570.
In
/main
, [ @SHA 4b6e0b1c ], while trying to re-run & validate repro created for issue #458 (branchagurajada/458-mini_destroy_unused-Assert
; new test case large_inserts_stress_test.c), we run into the following assertion using the debug binary:The extended assertion reports:
The input page-address
addr=1034551297
is clearly wrong.The relevant code in
/main
is this:On L644, the
addr + index_page_no
seems incorrect. We should multiply this index by the page-size.