vmware / splinterdb

High Performance Embedded Key-Value Store
https://splinterdb.org
Apache License 2.0
674 stars 56 forks source link

driver_test splinter_test --delete runs into Assertion failed src/trunk.c:1832:trunk_get_new_bundle(): "(hdr->end_bundle != hdr->start_bundle) #310

Open gapisback opened 2 years ago

gapisback commented 2 years ago

Reproduces on /main:

Fusion-LocalVM:[22] $ ./bin/driver_test splinter_test --delete --max-async-inflight 20 --num-insert-threads 20 --num-lookup-threads 10
./bin/driver_test: splinterdb_build_version a2768f5d
Dispatch test splinter_test
fingerprint_size: 27
Bumped up IO queue size to 416
Bumped up IO queue size to 416
Running splinter_test with 1 caches
splinter_test: splinter deletion test started with 1                tables
inserting  20% complete for table 0Assertion failed at src/trunk.c:1832:trunk_get_new_bundle(): "(hdr->end_bundle != hdr->start_bundle)". page disk_addr=393216, end_bundle=3, start_bundle=3
Aborted (core dumped)

Also ran into the same assertion when running driver_test splinter_test --seq-perf

Fusion-LocalVM:[30] $ ./bin/driver_test splinter_test --seq-perf --max-async-inflight 20 --num-insert-threads 20 --num-lookup-threads 10
./bin/driver_test: splinterdb_build_version a2768f5d-dirty
Dispatch test splinter_test
fingerprint_size: 27
Bumped up IO queue size to 416
Bumped up IO queue size to 416
Running splinter_test with 1 caches
splinter_test: splinter performance test started with 1                tables
inserting  16% complete for table 0Assertion failed at src/trunk.c:1832:trunk_get_new_bundle(): "(hdr->end_bundle != hdr->start_bundle)". page disk_addr=393216, end_bundle=10, start_bundle=10
Aborted (core dumped)

Also ran into the same issue when running splinter_test --perf with higher disk capacity --db-capacity-gib 60. At lower capacity, we run into Out-of-space errors.

Fusion-LocalVM:[49] $ ./bin/driver_test splinter_test --perf --max-async-inflight 10 --num-insert-threads 20 --num-lookup-threads 10 --num-range-lookup-threads 5 --db-capacity-gib 60
./bin/driver_test: splinterdb_build_version a2768f5d-dirty
Dispatch test splinter_test
fingerprint_size: 27
Running splinter_test with 1 caches
splinter_test: splinter performance test started with 1                tables
inserting  66% complete for table 0Assertion failed at src/trunk.c:1832:trunk_get_new_bundle(): "(hdr->end_bundle != hdr->start_bundle)". page disk_addr=655360, end_bundle=6, start_bundle=6
Aborted (core dumped)
rosenhouse commented 2 years ago

When I run this same command, I hit a different assertion:

$ ./bin/driver_test splinter_test --delete --max-async-inflight 20 --num-insert-threads 20 --num-lookup-threads 10
./bin/driver_test: splinterdb_build_version 9f8d5bc5
Dispatch test splinter_test
fingerprint_size: 27
Bumped up IO queue size to 416
Bumped up IO queue size to 416
Running splinter_test with 1 caches
splinter_test: splinter deletion test started with 1                tables
inserting  23% complete for table 0driver_test: tests/test_data.c:139: message_type test_data_message_class(const data_config *, uint64, const void *): Assertion `sizeof(data_handle) <= raw_data_len' failed.
[1]    46435 IOT instruction (core dumped)  ./bin/driver_test splinter_test --delete --max-async-inflight 20  20  10
rosenhouse commented 2 years ago

Oh, actually I filled the disk.... huh.

carlosgarciaalvarado commented 2 years ago

@rosenhouse are all these related to a full disk?

rosenhouse commented 2 years ago

I don't know yet. Maybe

ajhconway commented 2 years ago

I would like to understand what's going on here more. Does this replicate without async lookups? Is this a full disk issue?

I don't really understand how this comes up. We have a check before flushing that this is an available bundle, so it would be helpful to have more context here.

If this replicates without async lookups, I would consider this a critical bug.

rosenhouse commented 2 years ago

When I run this test without async lookups, I cannot reproduce the above assertion.

For the --delete test without async, it all works fine until the very end, after stats are printed, when it hits a different error, some floating point exception:

$ ./bin/driver_test splinter_test --delete --max-async-inflight 0 --num-insert-threads 20 --num-lookup-threads 10 --db-capacity-gib 60 --stats
./bin/driver_test: splinterdb_build_version a8566beb
Dispatch test splinter_test
fingerprint_size: 27
Running splinter_test with 1 caches
splinter_test: splinter deletion test started with 1                tables
inserting  99% complete for table 0
per-splinter per-thread insert time per tuple 13443 ns
splinter total insertion rate: 1487689 insertions/second
...
[1]    442581 floating point exception (core dumped)  ./bin/driver_test splinter_test --delete --max-async-inflight 0  20  10  60

For the --perf test without async, I see everything runs just fine, no crashes

./bin/driver_test splinter_test --perf --max-async-inflight 0 --num-insert-threads 20 --num-lookup-threads 10 --num-range-lookup-threads 5 --db-capacity-gib 60 --db-location /host-nvme0n1
./bin/driver_test: splinterdb_build_version a8566beb
Dispatch test splinter_test
fingerprint_size: 27
Running splinter_test with 1 caches
splinter_test: splinter performance test started with 1                tables
inserting  99% complete for table 0
per-splinter per-thread insert time per tuple 17980 ns
splinter total insertion rate: 1112336 insertions/second
splinter bandwidth: 0 megabytes/second
splinter max insert latency: 0 msec
Statistics are not enabled
Space used by level:
0:    35858MiB
1:     7172MiB
2:     1248MiB
3:      171MiB

lookups  99% complete for table 0
per-splinter per-thread lookup time per tuple 13552 ns
splinter total lookup rate: 737854 lookups/second
0% lookups were async
max lookup latency ns (sync=969144, async=0)
Statistics are not enabled
range lookups  99% complete for table 0
per-splinter per-thread range time per tuple 82831 ns
splinter total range rate: 60363 ops/second
Statistics are not enabled
range lookups  99% complete for table 0
per-splinter per-thread range time per tuple 227713 ns
splinter total range rate: 21957 ops/second
Statistics are not enabled
range lookups  99% complete for table 0
per-splinter per-thread range time per tuple 21525406 ns
splinter total range rate: 232 ops/second
Statistics are not enabled
After destroy:

root@2b37df4fb859:/splinterdb# echo $?
0
ajhconway commented 2 years ago

Yeah, I agree with @rosenhouse, and I'm going to drop critical tag on this.