Closed gaow closed 5 years ago
I am checking this with a VM... Too bad that I cannot find a single machine slurm VM.
Oh I'm not sure how reproducible it is. There indeed many failed jobs in this batch of submission, but of course not all jobs fail. I suspect some of the jobs become "zombies" that hold the slot; that is why I was asking if you order some status check report from my end and I'll show them here.
oK. I am running a task-spooler on a ubuntu vm, with number of concurrent jobs set to 20. Everything seems to be ok so I am on the way of creating a few failed jobs.
Great, while you are on it, maybe change trunk_size
to greater than 1 to reproduce #1147 . For my case i have 34K jobs, with trunk size 80. Each trunk has a few failed jobs (some patterns of edge case my code did not capture). Eventually it jammed up and now I see only one task in my queue, but still running and submitting ...
Without trunk_size
everything seems to be ok. There are failed tasks and the running tasks are kept almost constant at capacity.
Everything seems to be ok with the following script
[1]
input: for_each=dict(i=range(5000))
task: walltime='10m', trunk_size=100
print(f'this is task {i}')
import time
import random
time.sleep(random.random()*5)
fail_if(random.random() < 0.1)
On a VM with
vm:
address: 192.168.47.129
paths:
home: /home/bpeng1
description: task spooler on a single machine
queue_type: pbs
status_check_interval: 5
job_template: |
#!/bin/bash
cd {cur_dir}
sos execute {task} -v {verbosity} -s {sig_mode} {'--dryrun' if run_mode == 'dryrun' else ''}
max_running_jobs: 20
submit_cmd: tsp -L {task} sh {job_file}
status_cmd: tsp -s {job_id}
kill_cmd: tsp -r {job_id}
with TS_SLOTS
set to 20.
Perhaps the latest master works on your cluster as well.
Do you see the issue in #1147 ? At least that should show up ...
How about when you make the task an external script python:
and use raise ValueError
, rather than current SoS statements?
Let me kill my current job, upgrade and try running it again.
Just for completeness, my task
configuration is:
task: trunk_workers = 1, trunk_size = 5, walltime = '10m', mem = '5G', cores = 1, tags = f'{step_name}_{_output:bn}'
Okay am re-running using -s build
, but got
ERROR: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)
Now using -v3
to check it out. Can take a while (#1146) but will report back when I have it.
Here we go:
INFO: Step susie_bhat_1 (index=146) is ignored with signature constructed
DEBUG: Kill a substep worker. 4 remains.
DEBUG: stop substep worker 11328
INFO: Step susie_bhat_1 (index=147) is ignored with signature constructed
DEBUG: Kill a substep worker. 3 remains.
DEBUG: stop substep worker 11306
INFO: Step susie_bhat_1 (index=145) is ignored with signature constructed
DEBUG: Kill a substep worker. 2 remains.
DEBUG: stop substep worker 9945
INFO: Step susie_bhat_1 (index=143) is ignored with signature constructed
DEBUG: Kill a substep worker. 1 remains.
DEBUG: stop substep worker 11223
INFO: Step susie_bhat_1 (index=144) is ignored with signature constructed
File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 402, in cmd_run
executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
Traceback (most recent call last):
File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 402, in cmd_run
executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 266, in run
return self.run_as_master(targets=targets, mode=mode)
File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1187, in run_as_master
raise exec_error
sos.workflow_executor.ExecuteError: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)
[susie_bhat]: 1 pending step: susie_bhat_2
ERROR: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)
[susie_bhat]: 1 pending step: susie_bhat_2
[MW]
BTW it took 4min to "analyze" 34K jobs.
The error was passed from worker to master process so cannot tell where it come from... still checking.
This observation was related to several other smaller issues particularly those related to aborted
status. Not sure after all these fixes the problem can still be reproduced. So far so good and I've done my 34K analysis. There will surely be more of those analysis on my desk to do in the future. I'll close the ticket for now and reopen in the future if the problem persists.
Previously we seem to have the issue that maximum jobs in the queue are more than the allowed maximum. Today the problem seems the other way around: SoS is still submitting jobs like this:
a few at a time. But my queue looks like:
There is constantly only 2 to 3 jobs in the queue. It makes the whole undertaking a lot slower.
This is not what happens at the start of the run. Back then, SoS still uses about the maximum allowed slot in the queue. Only after a while it became like this. Now to get my stuff done i'd have to kill the current submission, remove all signatures and resubmit with
-s build
.I am wondering how SoS checks and decides to submit more? Maybe this is because there are many failed jobs (or jobs of some undetermined status) that filled up the slot? I can imaging eventually SoS will hang when these 3 remaining slots are somehow occupied. So before I kill and purge my jobs I'm happy to use my current status to help doing some diagnostics . @BoPeng please let me know what information you'd like me to extract and report.