Closed adi611 closed 1 year ago
Patch coverage has no change and project coverage change: -0.10%
:warning:
Comparison is base (
29d3d1f
) 82.88% compared to head (0f646d2
) 82.79%.:exclamation: Current head 0f646d2 differs from pull request most recent head 3d687b0. Consider uploading reports for the commit 3d687b0 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@adi611 @djarecka Interesting,
It looks like testing fails on macOS regardless of the Python version.
The tests succeed on Linux for Python 3.9, 3.10 and 3.11, though there are intermittent failures due to some tests running forever without triggering a timeout. This is quite concerning I think, as we probably don't want Pydra jobs to run indefinitely for unknown reasons.
pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] RERUN [ 88%]
This test is very long to process.
After more than an hour:
=========================== short test summary info ============================
FAILED pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] - Exception: graph is not empty, but not able to get more tasks - something may have gone wrong when retrieving the results of predecessor tasks. This could be caused by a file-system error or a bug in the internal workflow logic, but is likely to be caused by the hash of an upstream node being unstable.
Hash instability can be caused by an input of the node being modified in place, or by psuedo-random ordering of `set` or `frozenset` inputs (or nested attributes of inputs) in the hash calculation. To ensure that sets are hashed consistently you can you can try set the environment variable PYTHONHASHSEED=0 for all processes, but it is best to try to identify where the set objects are occurring and manually hash their sorted elements. (or use list objects instead)
Blocked tasks
-------------
mult (FunctionTask_9119c450eb4aba771bfa0b0d61c16836) is blocked by add2x (FunctionTask_32acf4c9930ee17f343a230ee86c85d3), which matches names of []; add2y (FunctionTask_ad4669811dc84a9749ce5e6c6ecc1204), which matches names of []
= 1 failed, 546 passed, 374 skipped, 3 xfailed, 84 warnings, 3 rerun in 3987.10s (1:06:27) =
@ghisvail - yes, I had a problem on my laptop.
How this is related to #673, should we close one ? if you only run test_workflow
(as in #673), everything works?
Should I add a commit updating GA workflow file to run the tests as two different jobs - one for test_workflow.py
and one for the rest?
On ubuntu-latest
:
test_wf_3nd_st_1[dask]
in test_workflow.py
fails after running for a considerable time, other tests passOn macos-latest
:
test_duplicate_input_on_split_wf
in test_workflow.py
fails due to timeout, other tests pass. Point to note is that this test runs on cf
plugin only and it passes on other GA workflows like testpydra.yml
Also for ubuntu-latest
, previously the Dask GA workflow failed for both test_duplicate_input_on_split_wf
and test_inner_outer_wf_duplicate
due to timeouts but now after the recent commits to the Pydra repo it passes for both.
A first remark, the logs contain a lot of UserWarning: Port 8787 is already in use.
Which might indicate that we are not cleaning Dask resources properly during the tests.
I was also wondering, have we got a test workflow not too trivial for Dask parallelization which we could exercise outside of pytest
?
More clues that there is probably something going on with Dask resources:
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'wait_for' was never awaited
self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'Client._close' was never awaited
self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.
@adi611 - so is it exactly the same version that was running fine when each test file was running separately? In my case it doesn't change anything...
I've tried to debug today on my osx but hasn't got too far yet..:(
I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.
It is confusing to debug, since even for the same environment the results are inconsistent.
@adi611 - I think I fixed the dask worker. You can either accept my PR that I made to your repository/branch or move to #689 you can also check how does it work on your local machine or collabs
[adi611-patch-updatedask-1](/adi611/pydra/tree/adi611-patch-updatedask-1)
I tried it on my local machine as well as on colab and no test fails! Thank you for the help!
ok, great! I will merge this now. I think it runs much faster now and we can run more tests with the dask plugin, but we can do it in a separate pull request.
Types of changes
Summary
Fixes the
DaskWorker
and add a GitHub Actions workflow file calledtestdask.yml
for Dask tests.Checklist