vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

Maximum running jobs not honored on cluster #1149

Closed gaow closed 5 years ago

gaow commented 5 years ago

Previously we seem to have the issue that maximum jobs in the queue are more than the allowed maximum. Today the problem seems the other way around: SoS is still submitting jobs like this:

INFO: M80_ad136f1872ea7cd1 submitted to midway2 with job id 55933675
INFO: M80_68f63a036f9ccfbd submitted to midway2 with job id 55933719
INFO: M80_f8309250aece0c9d submitted to midway2 with job id 55933727
INFO: M80_bb1d60e6337803f4 submitted to midway2 with job id 55933729
INFO: M80_534d82e405ce6605 submitted to midway2 with job id 55933735
INFO: M80_c3581ff31e05cf5f submitted to midway2 with job id 55933738
...

a few at a time. But my queue looks like:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          55933727   broadwl M80_f830     gaow  R       3:19      1 midway2-0025
          55933738   broadwl M80_c358     gaow  R       0:39      1 midway2-0089
          55933735   broadwl M80_534d     gaow  R       1:31      1 midway2-0015

There is constantly only 2 to 3 jobs in the queue. It makes the whole undertaking a lot slower.

This is not what happens at the start of the run. Back then, SoS still uses about the maximum allowed slot in the queue. Only after a while it became like this. Now to get my stuff done i'd have to kill the current submission, remove all signatures and resubmit with -s build.

I am wondering how SoS checks and decides to submit more? Maybe this is because there are many failed jobs (or jobs of some undetermined status) that filled up the slot? I can imaging eventually SoS will hang when these 3 remaining slots are somehow occupied. So before I kill and purge my jobs I'm happy to use my current status to help doing some diagnostics . @BoPeng please let me know what information you'd like me to extract and report.

BoPeng commented 5 years ago

I am checking this with a VM... Too bad that I cannot find a single machine slurm VM.

gaow commented 5 years ago

Oh I'm not sure how reproducible it is. There indeed many failed jobs in this batch of submission, but of course not all jobs fail. I suspect some of the jobs become "zombies" that hold the slot; that is why I was asking if you order some status check report from my end and I'll show them here.

BoPeng commented 5 years ago

oK. I am running a task-spooler on a ubuntu vm, with number of concurrent jobs set to 20. Everything seems to be ok so I am on the way of creating a few failed jobs.

gaow commented 5 years ago

Great, while you are on it, maybe change trunk_size to greater than 1 to reproduce #1147 . For my case i have 34K jobs, with trunk size 80. Each trunk has a few failed jobs (some patterns of edge case my code did not capture). Eventually it jammed up and now I see only one task in my queue, but still running and submitting ...

BoPeng commented 5 years ago

Without trunk_size everything seems to be ok. There are failed tasks and the running tasks are kept almost constant at capacity.

BoPeng commented 5 years ago

Everything seems to be ok with the following script

[1]
input: for_each=dict(i=range(5000))

task:  walltime='10m', trunk_size=100

print(f'this is task {i}')

import time
import random
time.sleep(random.random()*5)
fail_if(random.random() < 0.1)

On a VM with

    vm:
        address: 192.168.47.129
        paths:
            home: /home/bpeng1
        description: task spooler on a single machine
        queue_type: pbs
        status_check_interval: 5
        job_template: |
            #!/bin/bash
            cd {cur_dir}
            sos execute {task} -v {verbosity} -s {sig_mode} {'--dryrun' if run_mode == 'dryrun' else ''}
        max_running_jobs: 20
        submit_cmd: tsp -L {task} sh {job_file}
        status_cmd: tsp -s {job_id}
        kill_cmd: tsp -r {job_id}

with TS_SLOTS set to 20.

Perhaps the latest master works on your cluster as well.

gaow commented 5 years ago

Do you see the issue in #1147 ? At least that should show up ...

How about when you make the task an external script python: and use raise ValueError, rather than current SoS statements?

Let me kill my current job, upgrade and try running it again.

gaow commented 5 years ago

Just for completeness, my task configuration is:

task: trunk_workers = 1, trunk_size = 5, walltime = '10m', mem = '5G', cores = 1, tags = f'{step_name}_{_output:bn}'
gaow commented 5 years ago

Okay am re-running using -s build, but got

ERROR: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)

Now using -v3 to check it out. Can take a while (#1146) but will report back when I have it.

gaow commented 5 years ago

Here we go:

INFO: Step susie_bhat_1 (index=146) is ignored with signature constructed
DEBUG: Kill a substep worker. 4 remains.
DEBUG: stop substep worker 11328
INFO: Step susie_bhat_1 (index=147) is ignored with signature constructed
DEBUG: Kill a substep worker. 3 remains.
DEBUG: stop substep worker 11306
INFO: Step susie_bhat_1 (index=145) is ignored with signature constructed
DEBUG: Kill a substep worker. 2 remains.
DEBUG: stop substep worker 9945
INFO: Step susie_bhat_1 (index=143) is ignored with signature constructed
DEBUG: Kill a substep worker. 1 remains.
DEBUG: stop substep worker 11223
INFO: Step susie_bhat_1 (index=144) is ignored with signature constructed
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 402, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
Traceback (most recent call last):
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/__main__.py", line 402, in cmd_run
    executor.run(args.__targets__, mode='dryrun' if args.dryrun else 'run')
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 266, in run
    return self.run_as_master(targets=targets, mode=mode)
  File "/scratch/midway2/gaow/miniconda3/lib/python3.6/site-packages/sos/workflow_executor.py", line 1187, in run_as_master
    raise exec_error
sos.workflow_executor.ExecuteError: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)
[susie_bhat]: 1 pending step: susie_bhat_2
ERROR: [susie_bhat_1]: not enough values to unpack (expected 3, got 0)
[susie_bhat]: 1 pending step: susie_bhat_2
[MW] 

BTW it took 4min to "analyze" 34K jobs.

BoPeng commented 5 years ago

The error was passed from worker to master process so cannot tell where it come from... still checking.

gaow commented 5 years ago

This observation was related to several other smaller issues particularly those related to aborted status. Not sure after all these fixes the problem can still be reproduced. So far so good and I've done my 34K analysis. There will surely be more of those analysis on my desk to do in the future. I'll close the ticket for now and reopen in the future if the problem persists.