materialsproject / fireworks

The Fireworks Workflow Management Repo.
https://materialsproject.github.io/fireworks
Other
361 stars 185 forks source link

Pymongo 4 Fixes (WIP) #473

Closed ardunn closed 2 years ago

ardunn commented 2 years ago

Fixed tests:

Tests still to fix:

All 3 of these remaining failing tests are likely failing for the same reason. A separate process is created by an internal RocketProcess class in the tests but as per pymongo4 docs MongoClient is not fork-safe. In this separate process, the LaunchPad MongoClient is referred to but according to pymongo4 docs, new instances must be created by the forked processes. Gives this error:

Process RocketProcess-1:
pymongo.errors.AutoReconnect: connection pool paused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/Users/ardunn/alex/lbl/projects/common_env/dev_codes/fireworks/fireworks/core/tests/test_launchpad.py", line 602, in run
    launch_rocket(self.lpad, self.fworker)
  File "/Users/ardunn/alex/lbl/projects/common_env/dev_codes/fireworks/fireworks/core/rocket_launcher.py", line 58, in launch_rocket
    rocket_ran = rocket.run(pdb_on_exception=pdb_on_exception)
  File "/Users/ardunn/alex/lbl/projects/common_env/dev_codes/fireworks/fireworks/core/rocket.py", line 145, in run
    m_fw, launch_id = lp.checkout_fw(self.fworker, launch_dir, self.fw_id)
  File "/Users/ardunn/alex/lbl/projects/common_env/dev_codes/fireworks/fireworks/core/launchpad.py", line 1444, in checkout_fw
    m_fw = self._get_a_fw_to_run(fworker.query, fw_id=fw_id)
  File "/Users/ardunn/alex/lbl/projects/common_env/dev_codes/fireworks/fireworks/core/launchpad.py", line 1178, in _get_a_fw_to_run
    m_query, {"$set": {"state": "RESERVED", "updated_on": datetime.datetime.utcnow()}}, sort=sortby
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/collection.py", line 2565, in find_one_and_update
    session=session, **kwargs)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/collection.py", line 2289, in __find_and_modify
    write_concern.acknowledged, _find_and_modify, session)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1340, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1229, in _retry_with_session
    return self._retry_internal(retryable, func, session, bulk)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1253, in _retry_internal
    with self._get_socket(server, session) as sock_info:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1100, in _get_socket
    self.__all_credentials, handler=err_handler) as sock_info:
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/pool.py", line 1371, in get_socket
    sock_info = self._get_socket(all_credentials)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/pool.py", line 1436, in _get_socket
    self._raise_if_not_ready(emit_event=True)
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/pool.py", line 1408, in _raise_if_not_ready
    self.address, AutoReconnect('connection pool paused'))
  File "/Users/ardunn/alex/lbl/projects/common_env/fwenv/lib/python3.7/site-packages/pymongo/pool.py", line 250, in _raise_connection_failure
    raise AutoReconnect(msg) from error
pymongo.errors.AutoReconnect: localhost:27017: connection pool paused

fireworks/core/tests/test_launchpad.py:591 (LaunchPadLostRunsDetectTest.test_detect_lostruns)
self = <fireworks.core.tests.test_launchpad.LaunchPadLostRunsDetectTest testMethod=test_detect_lostruns>

    def test_detect_lostruns(self):
        # Launch the timed firework in a separate process
        class RocketProcess(Process):
            def __init__(self):
                super(self.__class__, self).__init__()
                self.lpad = LaunchPad(name=TESTDB_NAME, strm_lvl="ERROR")
                self.fworker = FWorker()

            def run(self):
                print("About to run rocketprocess")
                launch_rocket(self.lpad, self.fworker)
                print("Did run rocketprocess")

        rp = RocketProcess()
        rp.start()

        # Wait for fw to start
        it = 0
        while not any([f.state == "RUNNING" for f in self.lp.get_wf_by_fw_id(self.fw_id).fws]):
            time.sleep(1)  # Wait 1 sec
            it += 1
            if it == 10:
>               raise ValueError("FW never starts running")
E               ValueError: FW never starts running

test_launchpad.py:614: ValueError

Monitoring the mongo logs shows this rather unhelpful "connection ended"

{"t":{"$date":"2022-01-07T23:01:25.799-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn61","msg":"client metadata","attr":{"remote":"127.0.0.1:51989","client":"conn61","doc":{"driver":{"name":"PyMongo","version":"4.0.1"},"os":{"type":"Darwin","name":"Darwin","architecture":"x86_64","version":"10.16"},"platform":"CPython 3.7.3.final.0"}}}
{"t":{"$date":"2022-01-07T23:01:25.801-08:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn58","msg":"Connection ended","attr":{"remote":"127.0.0.1:51986","connectionId":58,"connectionCount":7}}
{"t":{"$date":"2022-01-07T23:01:25.808-08:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn60","msg":"Connection ended","attr":{"remote":"127.0.0.1:51988","connectionId":60,"connectionCount":6}}

I have not found a way to fix this yet, as making new instances of LaunchPad (and hence new instances of MongoClient) from within the forked process still gives the same error.

According to pymongo changelog v.4.0.1 fixed a related error with the connection pools (since pymongo4 brought an overhaul of this connection pooling)... so maybe this is just an unfixed mongo bug?

Need help

I am calling on any mongo/fws gurus/users out there to please help debug this

@mkhorton @computron @janosh

janosh commented 2 years ago

Definitely worth trying v.4.0.1 first, I would say.

ardunn commented 2 years ago

Definitely worth trying v.4.0.1 first, I would say.

Already have, same issue. I am guessing it is a similar problem to the one they fixed in 4.0.1 regarding connection pooling... only they haven't fixed it yet.

janosh commented 2 years ago

I am guessing it is a similar problem to the one they fixed in 4.0.1 regarding connection pooling... only they haven't fixed it yet.

Me too. Reading through the FAQs sounds like you're doing everything correctly by creating new client instances in child processes. And the way they work around some deadlock scenarios certainly seems a bit ad hoc:

https://github.com/mongodb/mongo-python-driver/blob/c760f900f2e4109a247c2ffc8ad3549362007772/pymongo/pool.py#L221-L225

# If the first getaddrinfo call of this interpreter's life is on a thread,
# while the main thread holds the import lock, getaddrinfo deadlocks trying
# to import the IDNA codec. Import it here, where presumably we're on the
# main thread, to avoid the deadlock. See PYTHON-607.
'foo'.encode('idna')

Maybe this paragraph describes what's happening here?

MongoClient spawns multiple threads to run background tasks such as monitoring connected servers. These threads share state that is protected by instances of Lock, which are themselves not fork-safe. The driver is therefore subject to the same limitations as any other multithreaded code that uses Lock (and mutexes in general). One of these limitations is that the locks become useless after fork(). During the fork, all locks are copied over to the child process in the same state as they were in the parent: if they were locked, the copied locks are also locked. The child created by fork() only has one thread, so any locks that were taken out by other threads in the parent will never be released in the child. The next time the child process attempts to acquire one of these locks, deadlock occurs.

I'm not sufficiently familiar with fireworks to understand why it needs to communicate to the DB through child processes but maybe it could be refactored so that only the parent does DB communication?

ardunn commented 2 years ago

I'm not sufficiently familiar with fireworks to understand why it needs to communicate to the DB through child processes but maybe it could be refactored so that only the parent does DB communication?

Well I don't think FireWorks itself needs to communicate thru multiple processes to the db, I think it is just this test. The test has multiple processes so that fws can submit some stuff to the workflow and continuously check when they get "lost" in parallel. I'll try your suggestion

computron commented 2 years ago

@ardunn please let me know when / if I should merge this ...

ardunn commented 2 years ago

@computron don't merge yet, I am still working on this

ardunn commented 2 years ago

@computron you can merge now; I wound up just disabling those troublesome detect_lostruns tests if pymongo 4+ version is running. The detect_lostruns tests are still run for pymongo 3 versions.

It is worth noting that the tests that were incompatible with pymongo4 only failed because of the way the test was written, not anything in fireworks itself. For that reason, I think it's pretty safe to merge.