nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Update DOCKER_IMAGE tag in testslurm.yml #697

Closed adi611 closed 9 months ago

adi611 commented 10 months ago

Types of changes

Summary

Update DOCKER_IMAGE tag in testslurm.yml from latest to 21.08.6

Checklist

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.32% :warning:

Comparison is base (31aea01) 83.31% compared to head (e797007) 82.99%. Report is 15 commits behind head on master.

:exclamation: Current head e797007 differs from pull request most recent head afeb705. Consider uploading reports for the commit afeb705 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #697 +/- ## ========================================== - Coverage 83.31% 82.99% -0.32% ========================================== Files 22 22 Lines 4873 4894 +21 Branches 1401 0 -1401 ========================================== + Hits 4060 4062 +2 - Misses 809 832 +23 + Partials 4 0 -4 ``` | [Flag](https://app.codecov.io/gh/nipype/pydra/pull/697/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/nipype/pydra/pull/697/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | `82.99% <100.00%> (-0.32%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files Changed](https://app.codecov.io/gh/nipype/pydra/pull/697?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [pydra/utils/hash.py](https://app.codecov.io/gh/nipype/pydra/pull/697?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#diff-cHlkcmEvdXRpbHMvaGFzaC5weQ==) | `95.03% <100.00%> (+2.17%)` | :arrow_up: | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/nipype/pydra/pull/697/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

satra commented 10 months ago

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

adi611 commented 10 months ago

btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm

you could drop the python versions lower than 3.8.

Sure, I'll start working on it.

adi611 commented 9 months ago

I built the docker image, it can be found here .

= 16 failed, 910 passed, 88 skipped, 7 xfailed, 7 warnings, 1 rerun in 935.96s (0:15:35) =


- The GA workflow file I used can be found [here](https://github.com/adi611/pydra/actions/runs/6138748027/workflow)
satra commented 9 months ago

nice work @adi611 - it may be good to track down those specific tests to see what's going on perhaps executing and debugging why the slurm return doesn't have a job id.

at least we have an up to date slurm container now! for this PR perhaps change to your newly built slurm container.

djarecka commented 9 months ago

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

adi611 commented 9 months ago

Should I create a PR to run the tests using all the available python versions (3.8.16, 3.9.16, 3.10.9, 3.11.1) for the container or using just the default version (3.11.1)?

satra commented 9 months ago

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

adi611 commented 9 months ago

@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...

But I run twice the GA in this PR and seems to work fine.

fyi. still don't have access to the MIT slurm computers, so can't compare.

Yes I did re-run the workflow but I still got the same errors

adi611 commented 9 months ago

let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.

Ok sure.

satra commented 9 months ago

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

djarecka commented 9 months ago

@satra - I also tested the new one

@adi611 - you can also try to remove -n auto from the pytest command

djarecka commented 9 months ago

just want to confirm that with -n auto I also see errors running in the container on my laptop (earlier missed the fact that GA runs with -n).

I don't understand why -n leads to the error in this case, but I would say that if running serially all the tests don't lead to the issue, we could go with it for now

adi611 commented 9 months ago

I think there may be some confusion.

adi611 commented 9 months ago

@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.

Yes the errors exist even while limiting pytest to a single test from the list of failed tests, as can be seen here.

djarecka commented 9 months ago

I was suggesting removing the -n option just based on what i see when running the test with the container on my laptop

On Tue, Sep 12, 2023, 04:21 Aditya Agarwal @.***> wrote:

I think there may be some confusion.

— Reply to this email directly, view it on GitHub https://github.com/nipype/pydra/pull/697#issuecomment-1715235566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMV6GR34SNJIZT2EBN6RWTX2ALRRANCNFSM6AAAAAA4QXHBXE . You are receiving this because you were mentioned.Message ID: @.***>

djarecka commented 9 months ago

@adi611 - I've just checked the GA reports and I see that there is -n auto in the pytest command

adi611 commented 9 months ago

I ran it separately here. But I should update the pytest command in the current PR to check if it runs fine without the -n option.

djarecka commented 9 months ago

ok, it looks like for GA, the option doesn't really make any difference...

Do you also see the same error when running on your laptop?

adi611 commented 9 months ago

Yes I tried it on my laptop and I get the same error RuntimeError: Could not extract job ID for failed tests.

satra commented 9 months ago

@adi611 - have you tried debugging that specific error on your laptop by running that single test by hand inside the instance?

adi611 commented 9 months ago

@adi611 - have you tried debugging that specific error on your laptop by running that single test by hand inside the instance?

When I try to run the failed tests individually on my laptop most of them pass except for a few which still give the error RuntimeError: Could not extract job ID. From the error logs it seemed to be something related to asyncio in python 3.11, so I tried the same with python 3.9.16 and no test fails.

satra commented 9 months ago

@adi611 - for those tests, could you enable more debugging to check what it is in the output that prevents getting the job id? the job id is usually read from the stdout. what does that look like?

adi611 commented 9 months ago

I will do that now and let you know.

adi611 commented 9 months ago

@adi611 - for those tests, could you enable more debugging to check what it is in the output that prevents getting the job id? the job id is usually read from the stdout. what does that look like?

The stdout looks something like Submitted batch job 1745

satra commented 9 months ago

was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.

adi611 commented 9 months ago

The issue seems to be more random, with the two stdouts in lines 292 and 341 of workers.py. Sometimes the first stdout doesn't return anything, the exception is then Could not extract job ID otherwise when the second stdout doesn't return anything the exception is Job information not found, or they both might not return anything. When they both return something the test which previously failed actually pass.

adi611 commented 9 months ago

was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.

I am unable to find such a check

adi611 commented 9 months ago

Logs:

=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
========================= 1 failed, 1 warning in 9.04s =========================
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8ea0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 314, in _submit_job
  File "/pydra/pydra/engine/workers.py", line 336, in _poll_job
  File "/pydra/pydra/engine/helpers.py", line 331, in read_and_display_async
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/subprocess.py", line 218, in create_subprocess_exec
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 1688, in subprocess_exec
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8bc0>
Traceback (most recent call last):
  File "/pydra/pydra/engine/workers.py", line 293, in _submit_job
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f9bdc8958a0>
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 558, in close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 582, in _close
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
  File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
RuntimeError: Event loop is closed

(The line numbers for workers.py may not exactly match the remote due to my local changes)

adi611 commented 9 months ago

The same test:

============================= test session starts ==============================
platform linux -- Python 3.11.1, pytest-7.4.2, pluggy-1.3.0 -- /root/.pyenv/versions/3.11.1/bin/python3.11
cachedir: .pytest_cache
rootdir: /pydra
plugins: rerunfailures-12.0, cov-4.1.0, forked-1.6.0, timeout-2.1.0, env-1.0.1, xdist-1.34.0
collecting ... collected 1 item

pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] PASSED

=============================== warnings summary ===============================
pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm]
  /pydra/pydra/engine/helpers.py:469: DeprecationWarning: There is no current event loop
    loop = asyncio.get_event_loop()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform linux, python 3.11.1-final-0 -----------
Coverage XML written to file /pydra/cov.xml

======================== 1 passed, 1 warning in 10.45s =========================
satra commented 9 months ago

instead of just running it, you can run the test with the pdb option, you can also enable debug logging around the relevant parts of the code. that may give you more insight into the state of the system. i suspect this is a resource contention with the slurm database issue. and couple of retries could help or looking at stderr and stdout.

djarecka commented 9 months ago

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

adi611 commented 9 months ago

@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path as cache_dir (see for cache_dir=tmp_path in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?

I checked and many of the failed tests, like test_node_task.py::test_task_state_2 and test_shelltask.py::test_shell_cmd_7 already have the cache_dir=tmp_path.

adi611 commented 9 months ago

Currently I am unable to reproduce the issue for single tests

adi611 commented 9 months ago

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

ghisvail commented 9 months ago

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

adi611 commented 9 months ago

This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.

If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.

Could you reference the exact issues you think may be of interest on cpython?

This is one such issue

djarecka commented 9 months ago

thanks for tracking this! yes, please check for 3.11.5!

djarecka commented 9 months ago

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!

satra commented 9 months ago

is there a way to exclude a series of python dependencies in pydra's python config? we should add that to the PR so we know why we did this.

adi611 commented 9 months ago

it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!

Thanks!

djarecka commented 9 months ago

@adi611 - could you please exclude the python version in pyproject.toml as @satra suggested

adi611 commented 9 months ago

I added !=3.11.1 to requires-python in pyproject.toml which specifies 3.11.1 should be excluded from the list of acceptable python versions.

adi611 commented 9 months ago

The Slurm workflow for 3.11.5 is failing at the Display previous jobs with sacct step:

Run echo "Allowing ports/daemons time to start" && sleep 10
Allowing ports/daemons time to start
 This cluster 'linux' doesn't exist.
        Contact your admin to add it to accounting.
Error: Process completed with exit code 1.

Is it possible this is due to the sleep time of 10 seconds not being enough?

djarecka commented 9 months ago

I've just restarted and it seems to work, if we have this issue again, we can increase the time

djarecka commented 9 months ago

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

adi611 commented 9 months ago

@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!

Thanks I'll do that, but I need some help since I've never done it before. Is there a preferred order for including my name? Also, what should I specify as my affiliation?

djarecka commented 9 months ago

we should think about the order, for now we don't use any rule except that Satra is at the end, so you can put your name before him. It's up to you what you want to use as your affiliation, you could use your university or you can leave it empty

satra commented 7 months ago

@adi611 - just a quick thing. can you post the slurm dockerfile somewhere?

adi611 commented 7 months ago

@satra - Sorry for the delay. I have added the dockerfile as a public github gist here.