Closed prometheanfire closed 2 years ago
After a bit of initial investigation, it looks like this doesn't reproduce consistently or in isolation. Running the failed tests by themselves works fine. I was able to reproduce one failure by running the entire functional test suite for Nova, but that's a rather heavy reproducer, and suggests there may be race condition going on here. I'm going to keep poking at it, but since it will likely take a while (each Nova functional test run takes about 15 minutes for me) I thought I'd provide my initial findings.
With https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913 reverted I haven't been able to reproduce the Nova functional test failure. Previously it reproduced pretty consistently for me.
My guess would be that there's some sort of race in the eventlet fix that hits us when we're running highly concurrent things like functional tests. That's a WAG though.
I'm going to push a PR to revert that change. While the workaround may be a little ugly, at least it works and it only affects monkey-patched environments (which are ugly by definition ;-).
@cybertron I hope you still remember this 1.5 year old issue :)
I'm hesitating to revert the eventlet thing in PR #37, so I'd like to investigate this a little more. You write that you can hit this in the Nova functional test suite, do you have some pointers on how to run it? And what system were you using, Linux?
Okay, luckily I was able to dredge up some memories of this. :-)
It does still reproduce with the nova functional tests. I created a fresh Fedora VM and installed the necessary bits, then tweaked the package versions to use the latest fasteners, and my first test run failed on this error. Here are the highlights of what I did:
git clone https://opendev.org/openstack/nova
cd nova
sudo dnf install tox
sudo dnf install gcc-c++
tox -e bindep
# Install all of the missing packages reported by that, e.g.
sudo dnf install -y graphviz libffi-devel libxslt-devel lsscsi mysql mysql-devel mysql-server openssl pcre-devel postgresql postgresql-server sqlite-devel
tox -e functional # This should succeed, it will take a while
.tox/functional/bin/pip install -U fasteners
tox -e functional # This should eventually fail, it might take more than one run
You don't necessarily have to use Fedora, bindep is distro-aware so it will give you the right package names for your distro. You'll need to figure out the tox and gcc package names if you don't already have them, of course.
I can confirm, that this replicates, on ubuntu 20.04 in my case. I'm having some success with running a limited test
tox -e functional nova.tests.functional.test_cross_cell_migrate.TestMultiCellMigrate
Which can be repeated until failure with this nice trick:
while tox -e functional nova.tests.functional.test_cross_cell_migrate.TestMultiCellMigrate; do :; done
Repro :tada: :tada: :tada:
import eventlet
import fasteners
rw_lock = fasteners.ReaderWriterLock()
def do(read):
if read:
with rw_lock.read_lock():
eventlet.greenthread.sleep(1)
else:
with rw_lock.write_lock():
eventlet.greenthread.sleep(1)
pool = eventlet.GreenPool(10)
list(pool.imap(do, [True] * 5 + [False] * 5))
So at the moment fasteners do not support eventlet green threads. Like at all.
Question for those who know - are there any other "thread" like objects we would like fasteners to support? Is there a list of libraries of this sort?
Actually scratch that, this is not even supposed to work, as it lacks the magic
eventlet.monkey_patch()
Adding this repro example works. Back to investigation :(
OK, I got it now - nova is not properly monkey_patch
ing it's eventlets. So the question is then as follows - should fasteners add a support for unpached eventlet threads, or should all our library users properly patch their threading module?
I don't like monkey patching, so I'm leaning towards supporting it, but I would have to think what is the impact to our API. I'd love to keep it clear and simple, with no magic, and definitely as an optional dependency. Would love to hear some opinions!
Details from the investigation:
Fasteners ReaderWriterLock is used in the nova.tests.fixtures.CellDatabases
class. Lock is created in the __init__
method, line 400, and used mainly in the _wrap_target_cell
method.
I added print statements above each call to read_lock
or write_lock
as follows:
print(f'At 1. {threading.current_thread().__module__}')
with self._cell_lock.read_lock():
...
And the output is as follows:
Captured stdout:
~~~~~~~~~~~~~~~~
At 5. threading
At 5. threading
At 5. threading
At 5. threading
At 5. threading
At 1. threading
At 2. threading
At 3. threading
...
This means that the threading
module is not monkey patched at that point, and all the treads are the same thread.
Then, the error itself is triggered when read lock is aquired at line 472, method yields ccontext
at line 486 and another thread tries to acquire write lock at 491, before the previous one returns from the yield to release the read lock. Since the two different green threads are the same, it is considered as a reader trying to become a writer, which triggers the exception.
Huh, that's strange. The functional tests should be fully monkey patched. The functional module has https://opendev.org/openstack/nova/src/branch/master/nova/tests/functional/__init__.py#L23 which calls https://opendev.org/openstack/nova/src/branch/master/nova/monkey_patch.py#L63 unless the debugger is enabled, in which case the thread module would be excluded, but at least in my env I wasn't using the debugger.
There is some funny business with the threading module later in that function though. Apparently it's related to https://github.com/eventlet/eventlet/issues/592 . Maybe that's causing an issue here?
I'm closing this, as the issue here most likely was related to spawn_n
, which is solved in fasteners 0.18, and oslo.concurency fixed it independently as well.
I don't have any guidance on how nova should migrate to fixed versions, the project is very complicated :)
Let me know if you ever try the fixed version and it does not work.
https://github.com/harlowja/fasteners/commit/467ed75ee1e9465ebff8b5edf452770befb93913
Hi, I think this commit caused an issue with nova (at the very least). Had an oslo guy (bnemec) look at it and he pointed to the commit.
Here's the log while it's still up.
http://logs.openstack.org/54/660254/2/check/cross-nova-functional/a0617a9/testr_results.html.gz
Backtraces follow.
The errors look more or less the same to me. If I could get indication if I should mask 0.15 in global-requirements.txt (as a known bad release) that'd be nice, right now it's just constraints that are keeping openstack safe.
Thanks