materialsproject / fireworks

The Fireworks Workflow Management Repo.
https://materialsproject.github.io/fireworks
Other
351 stars 184 forks source link

Fix pymongo 4.0 breaking changes #471

Closed ardunn closed 2 years ago

ardunn commented 2 years ago

As per @computron and @mkhorton suggestion

A note

Tbh I think the easiest thing to do is - instead of aliasing these SSL arguments to TLS depending on the version of pymongo, just remove the instance attributes of anything SSL and incorporate all these possible kwargs into mongoclient_kwargs. So everything SSL (pymongo<4.0) and everything TLS (pymongo>=4.0) is lumped into mongoclient_kwargs

Then there is no checking of versions, fewer lines of code, no weird conditional import, etc. This would seem reasonable to me if we make it very clear in the docstrings/docs how to do this. @computron I think this is worth considering.

But as of right now it has just been implemented as suggested.

Closes #469

ardunn commented 2 years ago

@computron @mkhorton There are some other problems preventing the pymongo 4 from working...

The ones I can't figure out are entirely in the LaunchPadLostRunsDetectTest suite.

mkhorton commented 2 years ago

The error is:

E TypeError: 'Collection' object is not callable. If you meant to call the 'insert' method on a 'Collection' object it is failing because no such method exists.

I think @munrojm fixed this same error recently in another code.

ardunn commented 2 years ago

That is not the error I'm having trouble with , that's fixed easily with insert --> insert_many

The troublesome ones are the lostruns launches where the firework never runs.

____________ LaunchPadLostRunsDetectTest.test_state_after_run_start ____________

self = <fireworks.core.tests.test_launchpad.LaunchPadLostRunsDetectTest testMethod=test_state_after_run_start>

    def test_state_after_run_start(self):
        # Launch the timed firework in a separate process
        class RocketProcess(Process):
            def __init__(self, lpad, fworker):
                super(self.__class__, self).__init__()
                self.lpad = lpad
                self.fworker = fworker

            def run(self):
                launch_rocket(self.lpad, self.fworker)

        rp = RocketProcess(self.lp, self.fworker)
        rp.start()

        # Wait for running
        it = 0
        while not any([f.state == "RUNNING" for f in self.lp.get_wf_by_fw_id_lzyfw(self.fw_id).fws]):
            time.sleep(1)  # Wait 1 sec
            it += 1
            if it == 10:
>               raise ValueError("FW never starts running")
E               ValueError: FW never starts running

fireworks/core/tests/test_launchpad.py:681: ValueError
----------------------------- Captured stdout call -----------------------------
2021-12-07 21:41:41,784 INFO Added a workflow. id_map: {-459: 1}
2021-12-07 21:41:41,789 INFO Launching Rocket
2021-12-07 21:41:51,819 INFO Performing db tune-up
2021-12-07 21:41:51,850 INFO LaunchPad was RESET.
----------------------------- Captured stderr call -----------------------------
Process RocketProcess-3:
pymongo.errors.AutoReconnect: connection pool paused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/test-environment/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/root/fireworks/fireworks/core/tests/test_launchpad.py", line 670, in run
    launch_rocket(self.lpad, self.fworker)
  File "/root/fireworks/fireworks/core/rocket_launcher.py", line 58, in launch_rocket
    rocket_ran = rocket.run(pdb_on_exception=pdb_on_exception)
  File "/root/fireworks/fireworks/core/rocket.py", line 145, in run
    m_fw, launch_id = lp.checkout_fw(self.fworker, launch_dir, self.fw_id)
  File "/root/fireworks/fireworks/core/launchpad.py", line 1438, in checkout_fw
    m_fw = self._get_a_fw_to_run(fworker.query, fw_id=fw_id)
  File "/root/fireworks/fireworks/core/launchpad.py", line 1173, in _get_a_fw_to_run
    m_query, {"$set": {"state": "RESERVED", "updated_on": datetime.datetime.utcnow()}}, sort=sortby
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/collection.py", line 2565, in find_one_and_update
    session=session, **kwargs)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/collection.py", line 2289, in __find_and_modify
    write_concern.acknowledged, _find_and_modify, session)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1340, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1229, in _retry_with_session
    return self._retry_internal(retryable, func, session, bulk)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1253, in _retry_internal
    with self._get_socket(server, session) as sock_info:
  File "/opt/conda/envs/test-environment/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1100, in _get_socket
    self.__all_credentials, handler=err_handler) as sock_info:
  File "/opt/conda/envs/test-environment/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/pool.py", line 1371, in get_socket
    sock_info = self._get_socket(all_credentials)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/pool.py", line 1436, in _get_socket
    self._raise_if_not_ready(emit_event=True)
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/pool.py", line 1408, in _raise_if_not_ready
    self.address, AutoReconnect('connection pool paused'))
  File "/opt/conda/envs/test-environment/lib/python3.7/site-packages/pymongo/pool.py", line 250, in _raise_connection_failure
    raise AutoReconnect(msg) from error
pymongo.errors.AutoReconnect: localhost:27017: connection pool paused
------------------------------ Captured log call -------------------------------
INFO     launchpad:launchpad.py:413 Added a workflow. id_map: {-459: 1}
DEBUG    launchpad:launchpad.py:775 Aggregation '[{'$match': {'name': 'timer'}}, {'$project': {'fw_id': True, '_id': False}}, {'$limit': 1}]'.
DEBUG    launchpad:launchpad.py:1122 RESTARTED fw_id, launch_id to (1, 1)
INFO     launchpad:launchpad.py:933 Performing db tune-up
DEBUG    launchpad:launchpad.py:935 Updating indices...
INFO     launchpad:launchpad.py:346 LaunchPad was RESET.
mkhorton commented 2 years ago

Gotcha. I'm afraid I'm out of ideas without digging in myself.

If you add a breakpoint, does this line return the correct FireWork?

m_fw = self._get_a_fw_to_run(fworker.query, fw_id=fw_id)

And if you run this $set command manually via pymongo, does that work?

{"$set": {"state": "RESERVED", "updated_on": datetime.datetime.utcnow()}}

mkhorton commented 2 years ago

@ardunn was your issue resolved?

ardunn commented 2 years ago

Uhh no it was not.... @computron can we unmerge until it is fixed?