vmware / splinterdb

High Performance Embedded Key-Value Store
https://splinterdb.org
Apache License 2.0
673 stars 56 forks source link

splinterdb_insert() hang under concurrent insertion #620

Closed chrisxu333 closed 3 months ago

chrisxu333 commented 3 months ago

When I perform concurrent insertion by calling splinterdb_insert(), each time I increase the thread count to be larger than 8, the splinterdb_insert() call seems to hang forever. I suspect that it may have something to do with deadlocks.

Config setup:

.cache_size = 2 Giga, .disk_size = 64 Giga, .data_cfg = &data_cfg, .use_shmem = FALSE, .io_flags = O_RDWR | O_CREAT | O_DIRECT,

Data config setup follows the default by calling default_data_config_init with key size of 8.

.max_key_size = 8, .key_compare = key_compare, .key_hash = platform_hash32, .merge_tuples = NULL, .merge_tuples_final = NULL, .key_to_string = key_to_string, .message_to_string = message_to_string,

Note that when I turned off O_DIRECT, everything works fine and it won't hang anymore.

chrisxu333 commented 3 months ago

I just tried to reproduce this bug with large_inserts_stress_test driver. After I add the O_DIRECT flag to the splinterdb_config in large_inserts_stress_test.c, the test will hang on large_inserts_stress:test_seq_keys_random_values_threaded most of the time, and a few times on other multi-threaded testcases as well. Any idea what might cause this?

gapisback commented 3 months ago

HI, @chrisxu333 --

When you say : "_When I perform concurrent insertion by calling splinterdb_insert(), each time I increase the thread count to be larger than 8, the splinterdbinsert() call seems to hang forever. " ...

Do you have a stand-alone repro that you wrote on your own? Or, were you relying on reproducing this issue using large_inserts_stress_test.c?

Re: "_After I add the O_DIRECT flag to the splinterdb_config in large_inserts_stresstest.c, the test will hang on..."

I suggest you do not try to use this stress-test and its sub-cases to reproduce the bug you are finding.

That stress-test is somewhat in a flux. Many test cases do work reliably but some of the cases in it are currently a bit incomplete and can lead to hang / unpredictable behaviour.

I have another revision of this large test-suite that is undergoing review, so until the time that open PR is addressed and integrated, I suggest you please not rely on this test-suite as an exerciser to reproduce your problem.

chrisxu333 commented 3 months ago

Hi @gapisback , To answer your first question, yes I'm running SplinterDB on my own benchmark driver that I wrote myself. The reason I tried to reproduce on that stress test is to avoid any potential mistakes that I might made in my driver, so that I could narrow down the actual cause of this bug to some extend.

So to rephrase the bug, when I run SplinterDB insertion under high concurrency (16 threads for instance), and I used O_DIRECT when I call splinterdb_create, the program will hang forever after some time.

rtjohnso commented 3 months ago

I can repro with large_inserts_stress_test per your instructions.

It looks like some io completions are not doing what they are supposed to. One deadlock had all threads complete except for one, which was waiting on the CC_WRITEBACK flag to be cleared on a page. Another had all threads complete except one, which was waiting on a req->busy flag to be cleared.

Will investigate. As @gapisback mentioned, one outcome of the investigation may be that the test is buggy. In that case it will be helpful to see the code you wrote. But let me try debugging it with large_inserts_stress_test first.

Thanks for the report.

chrisxu333 commented 3 months ago

Hi @rtjohnso thanks for the help. Let me know if you need my code :)

rtjohnso commented 3 months ago

@chrisxu333 -- can you check whether PR #621 fixes your issue?

chrisxu333 commented 3 months ago

@rtjohnso Yes I just ran it with my code and it works perfectly :) Thanks for your help

rtjohnso commented 3 months ago

Fixed by #621 .