scylladb / python-driver

ScyllaDB Python Driver, originally DataStax Python Driver for Apache Cassandra
https://python-driver.docs.scylladb.com
Apache License 2.0
74 stars 42 forks source link

regression running scylladb.git test.py topology/test_concurrent_schema #225

Closed avikivity closed 1 year ago

avikivity commented 1 year ago

works: 3.25.11 fails: 3.26.0

To test, checkout scylladb/scylladb@642854f36f7686aea4a37ed62cf57025c81e61b8 and

./tools/toolchain/dbuild
pip3 install -U  --only-binary=:all: -v scylla-driver
./test.py --mode dev topology/test_concurrent_schema --repeat 100

Without the driver update, it will pass. With the driver update, it will hang.

avikivity commented 1 year ago

/cc @fruch @kostja @mykaul

fruch commented 1 year ago

3.26.0 is mostly changes from upstream

but this fix (from @Lorak-mmk) sound like a possibly candidate f5c34f0b Fix wait_for_schema_agreement deadlock

fruch commented 1 year ago

@Lorak-mmk @avelanarius can you guys take a look ?

Lorak-mmk commented 1 year ago

I'll try to debug this new deadlock. My first observation is that integration tests would have caught this - after rebasing https://github.com/scylladb/python-driver/pull/219 on current master only 1 test fails, instead of 3, so other 2 failures were caused by https://github.com/scylladb/python-driver/pull/204

avelanarius commented 1 year ago

We are now returning to debugging and fixing this issue (last time @Lorak-mmk looked at it he had problems reproducing it reliably and there were other more important issues at the time, so there wasn't much effort done after that).

We can now reliably reproduce the issue and change the code of Python Driver (and have that also reproduce).

So far we have determined that at the moment the test hangs, there are 2 refresh_schema_and_set_result tasks stuck on the executor. Both of the tasks are stuck on trying to acquire a lock here:

https://github.com/scylladb/python-driver/blob/17ed1637f808c02d3439f8755e37b01e62d143d2/cassandra/cluster.py#L4186

This will never happen, as all of the lock releases are also performed on the executor:

https://github.com/scylladb/python-driver/blob/17ed1637f808c02d3439f8755e37b01e62d143d2/cassandra/cluster.py#L3895

Lock releases in inner are executed on the executor, but the executor is full, stuck on refresh_schema_and_set_result. This is very similar to https://github.com/scylladb/python-driver/issues/168 - the fix causing the regression actually fixed one deadlock, but unearthed another one...

Currently we are in the middle of fixing it.

Lorak-mmk commented 1 year ago

This should be fixed by https://github.com/scylladb/python-driver/pull/256 - but I think that in order to Scylla's unit tests to be fixed dbuild toolchain needs to be updated with new driver.

avikivity commented 1 year ago

ok, if our test suite is uncovering bugs in the driver then we're in a good position.

Although, it seems like we discovered the bug early (Apr 24) then admitted it through flakiness. But I guess that can't be helped with this sort of concurrency bug without a very aggressive test.

/cc @mykaul @kbr-scylla

fruch commented 1 year ago

ok, if our test suite is uncovering bugs in the driver then we're in a good position.

Although, it seems like we discovered the bug early (Apr 24) then admitted it through flakiness. But I guess that can't be helped with this sort of concurrency bug without a very aggressive test.

/cc @mykaul @kbr-scylla

FYI we have updated dtest this week with 3.26.3 that has #256 in it.

https://github.com/scylladb/scylla-dtest/pull/3608

Beside few tests that needed to be adapted to new functionally, we didn't noticed any other regression.

mykaul commented 1 year ago

@fruch - when you are comfortable with enough runs, we can close this I reckon.

fruch commented 1 year ago

@fruch - when you are comfortable with enough runs, we can close this I reckon.

Yes it's running in dtest and scylla unitests for a few days now and it's looks o.k. so far.