Closed ardunn closed 2 years ago
Definitely worth trying v.4.0.1 first, I would say.
Definitely worth trying v.4.0.1 first, I would say.
Already have, same issue. I am guessing it is a similar problem to the one they fixed in 4.0.1 regarding connection pooling... only they haven't fixed it yet.
I am guessing it is a similar problem to the one they fixed in 4.0.1 regarding connection pooling... only they haven't fixed it yet.
Me too. Reading through the FAQs sounds like you're doing everything correctly by creating new client instances in child processes. And the way they work around some deadlock scenarios certainly seems a bit ad hoc:
# If the first getaddrinfo call of this interpreter's life is on a thread,
# while the main thread holds the import lock, getaddrinfo deadlocks trying
# to import the IDNA codec. Import it here, where presumably we're on the
# main thread, to avoid the deadlock. See PYTHON-607.
'foo'.encode('idna')
Maybe this paragraph describes what's happening here?
MongoClient spawns multiple threads to run background tasks such as monitoring connected servers. These threads share state that is protected by instances of Lock, which are themselves not fork-safe. The driver is therefore subject to the same limitations as any other multithreaded code that uses Lock (and mutexes in general). One of these limitations is that the locks become useless after fork(). During the fork, all locks are copied over to the child process in the same state as they were in the parent: if they were locked, the copied locks are also locked. The child created by fork() only has one thread, so any locks that were taken out by other threads in the parent will never be released in the child. The next time the child process attempts to acquire one of these locks, deadlock occurs.
I'm not sufficiently familiar with fireworks to understand why it needs to communicate to the DB through child processes but maybe it could be refactored so that only the parent does DB communication?
I'm not sufficiently familiar with fireworks to understand why it needs to communicate to the DB through child processes but maybe it could be refactored so that only the parent does DB communication?
Well I don't think FireWorks itself needs to communicate thru multiple processes to the db, I think it is just this test. The test has multiple processes so that fws can submit some stuff to the workflow and continuously check when they get "lost" in parallel. I'll try your suggestion
@ardunn please let me know when / if I should merge this ...
@computron don't merge yet, I am still working on this
@computron you can merge now; I wound up just disabling those troublesome detect_lostruns tests if pymongo 4+ version is running. The detect_lostruns tests are still run for pymongo 3 versions.
It is worth noting that the tests that were incompatible with pymongo4 only failed because of the way the test was written, not anything in fireworks itself. For that reason, I think it's pretty safe to merge.
Fixed tests:
.insert
syntax; updated toinsert_one
Tests still to fix:
test_launchpad.LaunchPadLostRunsDetectTest.test_detect_lostruns
test_launchpad.LaunchPadLostRunsDetectTest.test_detect_lostruns_defuse
test_state_after_run_start
All 3 of these remaining failing tests are likely failing for the same reason. A separate process is created by an internal
RocketProcess
class in the tests but as per pymongo4 docs MongoClient is not fork-safe. In this separate process, the LaunchPad MongoClient is referred to but according to pymongo4 docs, new instances must be created by the forked processes. Gives this error:Monitoring the mongo logs shows this rather unhelpful "connection ended"
I have not found a way to fix this yet, as making new instances of LaunchPad (and hence new instances of MongoClient) from within the forked process still gives the same error.
According to pymongo changelog v.4.0.1 fixed a related error with the connection pools (since pymongo4 brought an overhaul of this connection pooling)... so maybe this is just an unfixed mongo bug?
Need help
I am calling on any mongo/fws gurus/users out there to please help debug this
@mkhorton @computron @janosh