Draft: Adding new worker which uses PSI/J to run tasks

adi611 commented 10 months ago

Types of changes

New feature (non-breaking change which adds functionality)

Summary

Add a new worker called PsijWorker to workers.py which uses PSI/J to run tasks.

Checklist

[ ] I have added tests to cover my changes (if necessary)
[ ] I have updated documentation (if necessary)

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 21.87% and project coverage change: -1.61% :warning:

Comparison is base (0245cdc) 83.42% compared to head (28ec2fd) 81.82%.

:exclamation: Current head 28ec2fd differs from pull request most recent head 2c695d5. Consider uploading reports for the commit 2c695d5 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #694 +/- ## ========================================== - Coverage 83.42% 81.82% -1.61% ========================================== Files 22 23 +1 Lines 4894 4990 +96 Branches 1410 0 -1410 ========================================== Hits 4083 4083 - Misses 807 907 +100 + Partials 4 0 -4 ``` | [Flag](https://app.codecov.io/gh/nipype/pydra/pull/694/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/nipype/pydra/pull/694/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | `81.82% <21.87%> (-1.61%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/nipype/pydra/pull/694?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [pydra/engine/workers.py](https://app.codecov.io/gh/nipype/pydra/pull/694?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL3dvcmtlcnMucHk=) | `18.31% <11.66%> (-4.42%)` | :arrow_down: | | [pydra/engine/run\_pickled.py](https://app.codecov.io/gh/nipype/pydra/pull/694?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvZW5naW5lL3J1bl9waWNrbGVkLnB5) | `23.80% <23.80%> (ø)` | | | [pydra/conftest.py](https://app.codecov.io/gh/nipype/pydra/pull/694?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvY29uZnRlc3QucHk=) | `68.62% <60.00%> (-9.16%)` | :arrow_down: | ... and [3 files with indirect coverage changes](https://app.codecov.io/gh/nipype/pydra/pull/694/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

djarecka commented 10 months ago

i'm trying to understand why the slurm test fails now (last run I cancelled, but it took 5h...), since you didn't any new test here yet. any idea?

djarecka commented 10 months ago

@adi611 - perhaps you can merge current master and try new actions/checkout that was introduced. I've run test_shelltask.py with the docker image on my laptop and it was fine

adi611 commented 10 months ago

@adi611 - perhaps you can merge current master and try new actions/checkout that was introduced. I've run test_shelltask.py with the docker image on my laptop and it was fine

Ok I will try this out

adi611 commented 10 months ago

i'm trying to understand why the slurm test fails now (last run I cancelled, but it took 5h...), since you didn't any new test here yet. any idea?

I couldn't find any reason for the workflow to not work here, when it works elsewhere. I will try the new actions/checkout and check if the problem persists.

adi611 commented 10 months ago

I have updated the worker. Currently I am using the slurm instance for psij.JobExecutor to run Slurm tests.

Prerequisite:

The local pydra repository must be mounted as /pydra as done in the Slurm GA workflow
Install psij-python in the docker container
Change the plugin for the test to psij or make changes in conftest file as discussed in #691

Features:

Can run the Slurm tests individually

Known issues:

Tests fail when running many simultaneously. For e.g., pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_4 pass but test_shell_cmd_4 fails when testing all tests in test_shelltask.py.
Not clean code

I am working on fixing these issues.

djarecka commented 9 months ago

@adi611 - I've had problems with my MIT account during the weekend, so I was not able to test it, but hopefully will be able to solve my issues tomorrow

adi611 commented 9 months ago

@djarecka - I think the issue arises when running the tests with the psij plugin and -n auto mode of pytest. I tried running all the tests in test_shelltask.py:

Using the local instance of the psij job executor, with the command pytest --color=yes -v pydra/engine/tests/test_shelltask.py. Here is the result: (2 tests fail, others run and pass without any issue)

========================================== short test summary info ==========================================
FAILED pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_1b_exception[psij] - Failed: DID NOT RAISE <class 'Exception'>
FAILED pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_2a_exception[psij] - Failed: DID NOT RAISE <class 'Exception'>
====================== 2 failed, 147 passed, 1 skipped, 3 xfailed in 388.45s (0:06:28) ======================

Using the Slurm container and the slurm instance of psij job executor (without -n auto): the tests run and pass, but at a very slow pace
Using the Slurm container and the slurm instance of psij job executor (with -n auto): many tests fail

adi611 commented 9 months ago

I have fixed the issue with pytest's -n auto mode; it can now run multiple tests at once.

djarecka commented 9 months ago

@adi611 - I've tried to run the tests with new worker on our cluster, but tests are failing, I'm not able to get results when running with Submitter. Will try to check debug it in the next few days

adi611 commented 9 months ago

@adi611 - I've tried to run the tests with new worker on our cluster, but tests are failing, I'm not able to get results when running with Submitter. Will try to check debug it in the next few days

I think the issue might be with the paths to files and functions (like /pydra/pydra/engine/run_pickled_function.py), and/or the python3.9 command. I used the previous Slurm container as reference for writing this and I should make it more generalized now. Could you please check if the issue still persists after correcting the paths and python command? Meanwhile I will update the PR.

djarecka commented 9 months ago

yes, it should be more generalized, you can always use the task output_dir

adi611 commented 9 months ago

I think it should work now

adi611 commented 9 months ago

Added the option to switch between the different schedulers provided by psij. For e.g., plugin='psij-local' and plugin='psij-slurm' will use local and slurm instances for psij's job executor respectively. For now, the list contains only local and slurm options.

djarecka commented 9 months ago

could you please modify conftest so we can start testing the new workers in GA.

I've started testing psij-local on my osx and 5 tests are failing. Could you try to debug them?

I also run a subset of tests on slurm on the MIT cluster and it looks like they are fixed! :) but I will test more tomorrow!

adi611 commented 9 months ago

could you please modify conftest so we can start testing the new workers in GA.

I'm currently working on it.

I've started testing psij-local on my osx and 5 tests are failing. Could you try to debug them?

Can you please tell me which tests are failing?

I also run a subset of tests on slurm on the MIT cluster and it looks like they are fixed! :) but I will test more tomorrow!

Great!

adi611 commented 9 months ago

The psij-local and psij-slurm pass all the required tests, though psij-slurm takes much longer than the actual slurm plugin. I will update the conftest.py file to resolve the conflicts with other tests. Please let me know if I need to make other changes.

adi611 commented 9 months ago

I will also try to improve the performance of psij-slurm

djarecka commented 9 months ago

with new version the psij-local works on my osx! will check the new version of slurm tomorrow morning.

for the conftest, you could just expand the list of worker and add psij-local to cf and psij-slurm to slurm and see how long it takes on GA

adi611 commented 9 months ago

for the conftest, you could just expand the list of worker and add psij-local to cf and psij-slurm to slurm and see how long it takes on GA

I have updated the conftest.py to add the psij plugin options. The fixtures --psij=local and --psij=slurm correspond to the psij-local and psij-slurm plugins respectively. Also, I have created specific GA workflows for the psij plugins: PSI/J-Local and PSI/J-SLURM.

adi611 commented 9 months ago

For PSI/J-SLURM it took 1h 56m 35s to complete, as can be seen here.

adi611 commented 9 months ago

@djarecka - I checked the test_dockertask tests in PSI/J-Local / test and all the tests which pass for cf pass for psij-local, for e.g., test_docker_1 pass while test_docker_3 gets skipped, for both cf and psij-local.

nipype / pydra