Closed avikivity closed 1 year ago
/cc @fruch @kostja @mykaul
3.26.0 is mostly changes from upstream
but this fix (from @Lorak-mmk) sound like a possibly candidate f5c34f0b Fix wait_for_schema_agreement deadlock
@Lorak-mmk @avelanarius can you guys take a look ?
I'll try to debug this new deadlock. My first observation is that integration tests would have caught this - after rebasing https://github.com/scylladb/python-driver/pull/219 on current master only 1 test fails, instead of 3, so other 2 failures were caused by https://github.com/scylladb/python-driver/pull/204
We are now returning to debugging and fixing this issue (last time @Lorak-mmk looked at it he had problems reproducing it reliably and there were other more important issues at the time, so there wasn't much effort done after that).
We can now reliably reproduce the issue and change the code of Python Driver (and have that also reproduce).
So far we have determined that at the moment the test hangs, there are 2 refresh_schema_and_set_result
tasks stuck on the executor. Both of the tasks are stuck on trying to acquire a lock here:
This will never happen, as all of the lock releases are also performed on the executor:
Lock releases in inner
are executed on the executor, but the executor is full, stuck on refresh_schema_and_set_result
. This is very similar to https://github.com/scylladb/python-driver/issues/168 - the fix causing the regression actually fixed one deadlock, but unearthed another one...
Currently we are in the middle of fixing it.
This should be fixed by https://github.com/scylladb/python-driver/pull/256 - but I think that in order to Scylla's unit tests to be fixed dbuild toolchain needs to be updated with new driver.
ok, if our test suite is uncovering bugs in the driver then we're in a good position.
Although, it seems like we discovered the bug early (Apr 24) then admitted it through flakiness. But I guess that can't be helped with this sort of concurrency bug without a very aggressive test.
/cc @mykaul @kbr-scylla
ok, if our test suite is uncovering bugs in the driver then we're in a good position.
Although, it seems like we discovered the bug early (Apr 24) then admitted it through flakiness. But I guess that can't be helped with this sort of concurrency bug without a very aggressive test.
/cc @mykaul @kbr-scylla
FYI we have updated dtest this week with 3.26.3 that has #256 in it.
https://github.com/scylladb/scylla-dtest/pull/3608
Beside few tests that needed to be adapted to new functionally, we didn't noticed any other regression.
@fruch - when you are comfortable with enough runs, we can close this I reckon.
@fruch - when you are comfortable with enough runs, we can close this I reckon.
Yes it's running in dtest and scylla unitests for a few days now and it's looks o.k. so far.
works: 3.25.11 fails: 3.26.0
To test, checkout scylladb/scylladb@642854f36f7686aea4a37ed62cf57025c81e61b8 and
Without the driver update, it will pass. With the driver update, it will hang.