Closed adi611 closed 9 months ago
Patch coverage: 100.00%
and project coverage change: -0.32%
:warning:
Comparison is base (
31aea01
) 83.31% compared to head (e797007
) 82.99%. Report is 15 commits behind head on master.:exclamation: Current head e797007 differs from pull request most recent head afeb705. Consider uploading reports for the commit afeb705 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm
you could drop the python versions lower than 3.8.
btw, you may want to see if you can build this dockerfile: https://github.com/tazend/docker-centos7-slurm/blob/1cdc401df445ecf00e2db431a99c583eda950300/Dockerfile as it contains the latest slurm
you could drop the python versions lower than 3.8.
Sure, I'll start working on it.
I built the docker image, it can be found here .
This cluster linux already exists. Not adding.
at the step Display previous jobs with sacct
(logs here). So I removed the following:
docker exec slurm bash -c "sacctmgr -i add cluster name=linux \
&& supervisorctl restart slurmdbd \
&& supervisorctl restart slurmctld"
rc, stdout, stderr = await read_and_display_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: coroutine ignored GeneratorExit
RuntimeError: Could not extract job ID
=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-list-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter0-state_splitter0-state_rpn0-expected0-expected_ind0] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_inputspec_state_1a[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-array-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_2[slurm-mixed-splitter1-state_splitter1-state_rpn1-expected1-expected_ind1] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_6a[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter2-a-state_splitter2-state_rpn2-state_combiner2-state_combiner_all2-NA.b-state_rpn_final2-expected2-expected_val2] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter3-b-state_splitter3-state_rpn3-state_combiner3-state_combiner_all3-NA.a-state_rpn_final3-expected3-expected_val3] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_node_task.py::test_task_state_comb_2[slurm-splitter4-combiner4-state_splitter4-state_rpn4-state_combiner4-state_combiner_all4-None-state_rpn_final4-expected4-expected_val4] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_outputspec_8c[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_state_outputspec_1[slurm-result_submitter] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_6[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wf_3nd_st_2[slurm] - RuntimeError: Could not find results of 'mult' node in a sub-directory name...
FAILED pydra/pydra/engine/tests/test_workflow.py::test_wfasnd_st_2[slurm] - ValueError: Tasks ['wfnd'] raised an error
= 16 failed, 910 passed, 88 skipped, 7 xfailed, 7 warnings, 1 rerun in 935.96s (0:15:35) =
- The GA workflow file I used can be found [here](https://github.com/adi611/pydra/actions/runs/6138748027/workflow)
nice work @adi611 - it may be good to track down those specific tests to see what's going on perhaps executing and debugging why the slurm return doesn't have a job id.
at least we have an up to date slurm container now! for this PR perhaps change to your newly built slurm container.
@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...
But I run twice the GA in this PR and seems to work fine.
fyi. still don't have access to the MIT slurm computers, so can't compare.
Should I create a PR to run the tests using all the available python versions (3.8.16, 3.9.16, 3.10.9, 3.11.1) for the container or using just the default version (3.11.1)?
let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.
@adi611 - Have you tried to repeat the run? Do you have the errors every time you run? I've run the test with your new image on my laptop and can't reproduce the errors...
But I run twice the GA in this PR and seems to work fine.
fyi. still don't have access to the MIT slurm computers, so can't compare.
Yes I did re-run the workflow but I still got the same errors
let's just get the default working. it may be overkill to try all at the moment. they are already being tested normally outside of slurm.
Ok sure.
@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.
@satra - I also tested the new one
@adi611 - you can also try to remove -n auto
from the pytest command
just want to confirm that with -n auto
I also see errors running in the container on my laptop (earlier missed the fact that GA runs with -n
).
I don't understand why -n
leads to the error in this case, but I would say that if running serially all the tests don't lead to the issue, we could go with it for now
I think there may be some confusion.
-n auto
in #694 -n auto
and I still get the same errors@adi611 - it looks like this now returns the same error as your list. perhaps you can check if you can reproduce one of those errors by limiting pytest to just check that test. also i think @djarecka tested the original slurm container not the new one.
Yes the errors exist even while limiting pytest to a single test from the list of failed tests, as can be seen here.
I was suggesting removing the -n
option just based on what i see when
running the test with the container on my laptop
On Tue, Sep 12, 2023, 04:21 Aditya Agarwal @.***> wrote:
I think there may be some confusion.
- I mentioned having issues with -n auto in #694 https://github.com/nipype/pydra/pull/694
- I tried running the current GA workflow without -n auto and I still get the same errors
— Reply to this email directly, view it on GitHub https://github.com/nipype/pydra/pull/697#issuecomment-1715235566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMV6GR34SNJIZT2EBN6RWTX2ALRRANCNFSM6AAAAAA4QXHBXE . You are receiving this because you were mentioned.Message ID: @.***>
@adi611 - I've just checked the GA reports and I see that there is -n auto
in the pytest command
I ran it separately here. But I should update the pytest command in the current PR to check if it runs fine without the -n
option.
ok, it looks like for GA, the option doesn't really make any difference...
Do you also see the same error when running on your laptop?
Yes I tried it on my laptop and I get the same error RuntimeError: Could not extract job ID
for failed tests.
@adi611 - have you tried debugging that specific error on your laptop by running that single test by hand inside the instance?
@adi611 - have you tried debugging that specific error on your laptop by running that single test by hand inside the instance?
When I try to run the failed tests individually on my laptop most of them pass except for a few which still give the error RuntimeError: Could not extract job ID
. From the error logs it seemed to be something related to asyncio
in python 3.11, so I tried the same with python 3.9.16 and no test fails.
@adi611 - for those tests, could you enable more debugging to check what it is in the output that prevents getting the job id? the job id is usually read from the stdout. what does that look like?
I will do that now and let you know.
@adi611 - for those tests, could you enable more debugging to check what it is in the output that prevents getting the job id? the job id is usually read from the stdout. what does that look like?
The stdout looks something like Submitted batch job 1745
was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.
The issue seems to be more random, with the two stdout
s in lines 292 and 341 of workers.py
. Sometimes the first stdout doesn't return anything, the exception is then Could not extract job ID
otherwise when the second stdout doesn't return anything the exception is Job information not found
, or they both might not return anything. When they both return something the test which previously failed actually pass.
was that the stdout for a task that failed? if so, you could perhaps look at where the exception is being raised and check if there is an intervening check like asking sacct or scontrol that the job has been queued.
I am unable to find such a check
Logs:
=========================== short test summary info ============================
FAILED pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] - RuntimeError: Could not extract job ID
========================= 1 failed, 1 warning in 9.04s =========================
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8ea0>
Traceback (most recent call last):
File "/pydra/pydra/engine/workers.py", line 314, in _submit_job
File "/pydra/pydra/engine/workers.py", line 336, in _poll_job
File "/pydra/pydra/engine/helpers.py", line 331, in read_and_display_async
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/subprocess.py", line 218, in create_subprocess_exec
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 1688, in subprocess_exec
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <coroutine object SlurmWorker._submit_job at 0x7f9bc66d8bc0>
Traceback (most recent call last):
File "/pydra/pydra/engine/workers.py", line 293, in _submit_job
RuntimeError: coroutine ignored GeneratorExit
Exception ignored in: <function BaseSubprocessTransport.__del__ at 0x7f9bdc8958a0>
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 126, in __del__
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_subprocess.py", line 104, in close
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 558, in close
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/unix_events.py", line 582, in _close
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 761, in call_soon
File "/root/.pyenv/versions/3.11.1/lib/python3.11/asyncio/base_events.py", line 519, in _check_closed
RuntimeError: Event loop is closed
(The line numbers for workers.py
may not exactly match the remote due to my local changes)
The same test:
============================= test session starts ==============================
platform linux -- Python 3.11.1, pytest-7.4.2, pluggy-1.3.0 -- /root/.pyenv/versions/3.11.1/bin/python3.11
cachedir: .pytest_cache
rootdir: /pydra
plugins: rerunfailures-12.0, cov-4.1.0, forked-1.6.0, timeout-2.1.0, env-1.0.1, xdist-1.34.0
collecting ... collected 1 item
pydra/pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm] PASSED
=============================== warnings summary ===============================
pydra/engine/tests/test_shelltask.py::test_shell_cmd_7[slurm]
/pydra/pydra/engine/helpers.py:469: DeprecationWarning: There is no current event loop
loop = asyncio.get_event_loop()
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
---------- coverage: platform linux, python 3.11.1-final-0 -----------
Coverage XML written to file /pydra/cov.xml
======================== 1 passed, 1 warning in 10.45s =========================
instead of just running it, you can run the test with the pdb option, you can also enable debug logging around the relevant parts of the code. that may give you more insight into the state of the system. i suspect this is a resource contention with the slurm database issue. and couple of retries could help or looking at stderr and stdout.
@adi611 - I've noticed that some tests do not use pytest's fixture: tmp_path
as cache_dir
(see for cache_dir=tmp_path
in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?
@adi611 - I've noticed that some tests do not use pytest's fixture:
tmp_path
ascache_dir
(see forcache_dir=tmp_path
in most of the tests), this can sometimes leads to issues. Could you fix the tests and see if that helps?
I checked and many of the failed tests, like test_node_task.py::test_task_state_2
and test_shelltask.py::test_shell_cmd_7
already have the cache_dir=tmp_path
.
Currently I am unable to reproduce the issue for single tests
This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.
This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.
If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.
Could you reference the exact issues you think may be of interest on cpython?
This seems to be a python 3.11.1 specific issue, should I try the newer version for 3.11 like 3.11.5 and see if it works? I have seen discussions on cpython with similar issues and maybe they have rolled out a fix for it.
If you're testing on the current Python branch, it is best to try out the latest published version first and then bissect with previous versions if you notice a regression.
Could you reference the exact issues you think may be of interest on cpython?
This is one such issue
thanks for tracking this! yes, please check for 3.11.5!
it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!
is there a way to exclude a series of python dependencies in pydra's python config? we should add that to the PR so we know why we did this.
it looks like it works for 3.11.5! :) just remove 3.11.1, and we will merge it. great job!
Thanks!
@adi611 - could you please exclude the python version in pyproject.toml
as @satra suggested
I added !=3.11.1
to requires-python
in pyproject.toml
which specifies 3.11.1 should be excluded from the list of acceptable python versions.
The Slurm workflow for 3.11.5 is failing at the Display previous jobs with sacct
step:
Run echo "Allowing ports/daemons time to start" && sleep 10
Allowing ports/daemons time to start
This cluster 'linux' doesn't exist.
Contact your admin to add it to accounting.
Error: Process completed with exit code 1.
Is it possible this is due to the sleep
time of 10 seconds not being enough?
I've just restarted and it seems to work, if we have this issue again, we can increase the time
@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!
@adi611 - I've realized that your name is not in the zenodo file, please open a PR if you want your name to be included!
Thanks I'll do that, but I need some help since I've never done it before. Is there a preferred order for including my name? Also, what should I specify as my affiliation?
we should think about the order, for now we don't use any rule except that Satra is at the end, so you can put your name before him. It's up to you what you want to use as your affiliation, you could use your university or you can leave it empty
@adi611 - just a quick thing. can you post the slurm dockerfile somewhere?
Types of changes
Summary
Update
DOCKER_IMAGE
tag intestslurm.yml
fromlatest
to21.08.6
Checklist