Fix DaskWorker and add GitHub Actions workflow for Dask tests

adi611 commented 1 year ago

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)

Summary

Fixes the DaskWorker and add a GitHub Actions workflow file called testdask.yml for Dask tests.

Checklist

[ ] I have added tests to cover my changes (if necessary)
[ ] I have updated documentation (if necessary)

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage has no change and project coverage change: -0.10% :warning:

Comparison is base (29d3d1f) 82.88% compared to head (0f646d2) 82.79%.

:exclamation: Current head 0f646d2 differs from pull request most recent head 3d687b0. Consider uploading reports for the commit 3d687b0 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #686 +/- ## ========================================== - Coverage 82.88% 82.79% -0.10% ========================================== Files 22 22 Lines 4845 4848 +3 Branches 1391 0 -1391 ========================================== - Hits 4016 4014 -2 - Misses 825 834 +9 + Partials 4 0 -4 ``` | [Flag](https://app.codecov.io/gh/nipype/pydra/pull/686/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/nipype/pydra/pull/686/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | `82.79% <0.00%> (-0.10%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/nipype/pydra/pull/686?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [pydra/engine/workers.py](https://app.codecov.io/gh/nipype/pydra/pull/686?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL3dvcmtlcnMucHk=) | `19.15% <0.00%> (-0.13%)` | :arrow_down: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/nipype/pydra/pull/686/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ghisvail commented 1 year ago

@adi611 @djarecka Interesting,

It looks like testing fails on macOS regardless of the Python version.

The tests succeed on Linux for Python 3.9, 3.10 and 3.11, though there are intermittent failures due to some tests running forever without triggering a timeout. This is quite concerning I think, as we probably don't want Pydra jobs to run indefinitely for unknown reasons.

ghisvail commented 1 year ago

pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] RERUN [ 88%]

This test is very long to process.

ghisvail commented 1 year ago

After more than an hour:

=========================== short test summary info ============================
FAILED pydra/engine/tests/test_workflow.py::test_wf_3nd_st_1[dask] - Exception: graph is not empty, but not able to get more tasks - something may have gone wrong when retrieving the results of predecessor tasks. This could be caused by a file-system error or a bug in the internal workflow logic, but is likely to be caused by the hash of an upstream node being unstable. 

Hash instability can be caused by an input of the node being modified in place, or by psuedo-random ordering of `set` or `frozenset` inputs (or nested attributes of inputs) in the hash calculation. To ensure that sets are hashed consistently you can you can try set the environment variable PYTHONHASHSEED=0 for all processes, but it is best to try to identify where the set objects are occurring and manually hash their sorted elements. (or use list objects instead)

Blocked tasks
-------------

mult (FunctionTask_9119c450eb4aba771bfa0b0d61c16836) is blocked by add2x (FunctionTask_32acf4c9930ee17f343a230ee86c85d3), which matches names of []; add2y (FunctionTask_ad4669811dc84a9749ce5e6c6ecc1204), which matches names of []
= 1 failed, 546 passed, 374 skipped, 3 xfailed, 84 warnings, 3 rerun in 3987.10s (1:06:27) =

djarecka commented 1 year ago

@ghisvail - yes, I had a problem on my laptop.

How this is related to #673, should we close one ? if you only run test_workflow (as in #673), everything works?

adi611 commented 1 year ago

Should I add a commit updating GA workflow file to run the tests as two different jobs - one for test_workflow.py and one for the rest?

adi611 commented 1 year ago

On ubuntu-latest:

for Python 3.10 and 3.11 - all the tests pass
for Python 3.9 - test_wf_3nd_st_1[dask] in test_workflow.py fails after running for a considerable time, other tests pass

On macos-latest:

for Python 3.9, 3.10 and 3.11 - test_duplicate_input_on_split_wf in test_workflow.py fails due to timeout, other tests pass. Point to note is that this test runs on cf plugin only and it passes on other GA workflows like testpydra.yml

Also for ubuntu-latest, previously the Dask GA workflow failed for both test_duplicate_input_on_split_wf and test_inner_outer_wf_duplicate due to timeouts but now after the recent commits to the Pydra repo it passes for both.

ghisvail commented 1 year ago

A first remark, the logs contain a lot of UserWarning: Port 8787 is already in use.

Which might indicate that we are not cleaning Dask resources properly during the tests.

ghisvail commented 1 year ago

I was also wondering, have we got a test workflow not too trivial for Dask parallelization which we could exercise outside of pytest?

ghisvail commented 1 year ago

More clues that there is probably something going on with Dask resources:

/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'wait_for' was never awaited
  self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/distributed/client.py:1542: RuntimeWarning: coroutine 'Client._close' was never awaited
  self.close()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

ghisvail commented 1 year ago

I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.

djarecka commented 1 year ago

@adi611 - so is it exactly the same version that was running fine when each test file was running separately? In my case it doesn't change anything...

I've tried to debug today on my osx but hasn't got too far yet..:(

adi611 commented 1 year ago

I re-run the tests for Python 3.10 on ubuntu-latest` and it failed this time, sadly.

It is confusing to debug, since even for the same environment the results are inconsistent.

djarecka commented 1 year ago

@adi611 - I think I fixed the dask worker. You can either accept my PR that I made to your repository/branch or move to #689 you can also check how does it work on your local machine or collabs

adi611 commented 1 year ago

[adi611-patch-updatedask-1](/adi611/pydra/tree/adi611-patch-updatedask-1)

I tried it on my local machine as well as on colab and no test fails! Thank you for the help!

djarecka commented 1 year ago

ok, great! I will merge this now. I think it runs much faster now and we can run more tests with the dask plugin, but we can do it in a separate pull request.

nipype / pydra