Closed chrisxu333 closed 3 months ago
I just tried to reproduce this bug with large_inserts_stress_test
driver. After I add the O_DIRECT flag to the splinterdb_config
in large_inserts_stress_test.c
, the test will hang on large_inserts_stress:test_seq_keys_random_values_threaded
most of the time, and a few times on other multi-threaded testcases as well. Any idea what might cause this?
HI, @chrisxu333 --
When you say : "_When I perform concurrent insertion by calling splinterdb_insert(), each time I increase the thread count to be larger than 8, the splinterdbinsert() call seems to hang forever. " ...
Do you have a stand-alone repro that you wrote on your own? Or, were you relying on reproducing this issue using large_inserts_stress_test.c
?
Re: "_After I add the O_DIRECT flag to the splinterdb_config in large_inserts_stresstest.c, the test will hang on..."
I suggest you do not try to use this stress-test and its sub-cases to reproduce the bug you are finding.
That stress-test is somewhat in a flux. Many test cases do work reliably but some of the cases in it are currently a bit incomplete and can lead to hang / unpredictable behaviour.
I have another revision of this large test-suite that is undergoing review, so until the time that open PR is addressed and integrated, I suggest you please not rely on this test-suite as an exerciser to reproduce your problem.
Hi @gapisback , To answer your first question, yes I'm running SplinterDB on my own benchmark driver that I wrote myself. The reason I tried to reproduce on that stress test is to avoid any potential mistakes that I might made in my driver, so that I could narrow down the actual cause of this bug to some extend.
So to rephrase the bug, when I run SplinterDB insertion under high concurrency (16 threads for instance), and I used O_DIRECT when I call splinterdb_create
, the program will hang forever after some time.
I can repro with large_inserts_stress_test
per your instructions.
It looks like some io completions are not doing what they are supposed to. One deadlock had all threads complete except for one, which was waiting on the CC_WRITEBACK
flag to be cleared on a page. Another had all threads complete except one, which was waiting on a req->busy
flag to be cleared.
Will investigate. As @gapisback mentioned, one outcome of the investigation may be that the test is buggy. In that case it will be helpful to see the code you wrote. But let me try debugging it with large_inserts_stress_test
first.
Thanks for the report.
Hi @rtjohnso thanks for the help. Let me know if you need my code :)
@chrisxu333 -- can you check whether PR #621 fixes your issue?
@rtjohnso Yes I just ran it with my code and it works perfectly :) Thanks for your help
Fixed by #621 .
When I perform concurrent insertion by calling splinterdb_insert(), each time I increase the thread count to be larger than 8, the splinterdb_insert() call seems to hang forever. I suspect that it may have something to do with deadlocks.
Config setup:
.cache_size = 2 Giga, .disk_size = 64 Giga, .data_cfg = &data_cfg, .use_shmem = FALSE, .io_flags = O_RDWR | O_CREAT | O_DIRECT,
Data config setup follows the default by calling
default_data_config_init
with key size of 8..max_key_size = 8, .key_compare = key_compare, .key_hash = platform_hash32, .merge_tuples = NULL, .merge_tuples_final = NULL, .key_to_string = key_to_string, .message_to_string = message_to_string,
Note that when I turned off O_DIRECT, everything works fine and it won't hang anymore.