openforcefield / openff-docs

Documentation for the Open Force Field ecosystem
https://docs.openforcefield.org/
MIT License
0 stars 2 forks source link

Pre-processing virtual site notebook failing #62

Open mattwthompson opened 1 week ago

mattwthompson commented 1 week ago

https://github.com/openforcefield/openff-docs/actions/runs/10755472185

Snippet of log is below. I'm not seeing these failures in my own nightly CI, so perhaps something is configured differently here? (https://github.com/openforcefield/openff-docs/issues/53 ?)

--------------------------------------------------------------------------------
openforcefield/openff-toolkit/virtual_sites/vsite_showcase.ipynb failed. Traceback:

Traceback (most recent call last):
  File "/home/runner/work/openff-docs/openff-docs/source/_ext/proc_examples.py", line 218, in execute_notebook
    executor.preprocess(nb, {"metadata": {"path": src.parent}})
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/nbconvert/preprocessors/execute.py", line 103, in preprocess
    self.preprocess_cell(cell, resources, index)
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/nbconvert/preprocessors/execute.py", line 124, in preprocess_cell
    cell = self.execute_cell(cell, index, store_history=True)
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/jupyter_core/utils/__init__.py", line 165, in wrapped
    return loop.run_until_complete(inner)
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/nbclient/client.py", line [100](https://github.com/openforcefield/openff-docs/actions/runs/10755472185/job/29827200581#step:7:101)5, in async_execute_cell
    exec_reply = await self.task_poll_for_reply
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/nbclient/client.py", line 806, in _async_poll_for_reply
    error_on_timeout_execute_reply = await self._async_handle_timeout(timeout, cell)
  File "/home/runner/micromamba/envs/openff-docs-examples/lib/python3.10/site-packages/nbclient/client.py", line 856, in _async_handle_timeout
    raise CellTimeoutError.error_from_timeout_and_cell(
nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 1200 seconds.
The message was: Cell execution timed out.
Here is a preview of the cell contents:
-------------------
interchange = force_field.create_interchange(topology=molecule.to_topology())

assert "VirtualSites" in interchange.collections.keys()

n_virtual_sites = len(interchange.collections["VirtualSites"].key_map)

print(f"There are {n_virtual_sites} virtual particles in this topology.")
-------------------

The following 1/29 notebooks failed:
     openforcefield/openff-toolkit/virtual_sites/vsite_showcase.ipynb
For tracebacks, see above.
Writing log to /home/runner/work/openff-docs/openff-docs/notebooks_log.json
Error: Process completed with exit code 1.

P.S. I'm still getting daily email about these runs - could updating the notification flow be looked into? I'm not the best person to figure out if these failures are genuine or if there's a change in the config, automation, etc.

Yoshanuikabundi commented 1 week ago

This seems to be an intermittent error, but I don't know where it's coming from. It probably has something to do with the fact that all the notebooks are run in parallel here, which speeds things up a lot even though the runners don't technically have enough cores. I've experimented with extending the timeout but its already much longer than it should need to be. What if I tried to set it up so that cell timeout errors did not cause the CI to fail?

mattwthompson commented 1 week ago

That seems like an okay band-aid; if the cause is something as silly as getting worse hardware, it would reduce this noise. But if there was a regression introduced in a new release that legitimately makes some common process goofily slow, it would go undetected.

If a few extra CPU cores would be helpful, what about running on better hardware? Either provided (reliable and expensive) by GitHub or by our new tools that hook into AWS?