ihmeuw-scicomp / jobmon

Other
3 stars 6 forks source link

Handle CLI resume if a downstream node is marked D #65

Open davidshaw-uw opened 10 months ago

davidshaw-uw commented 10 months ago

A slight edge case: the self-service CLI command update_task_status allows a user to set an arbitrary task to D or G state. In the event a task is set to D, i.e. marking "don't run", if that task has an upstream dependency and the user invokes a CLI resume the resume will fail with a trace like:

  File "/mnt/share/homes/dhs2018/repos/OneMod/src/onemod/main.py", line 164, in resume_pipeline
    resume_workflow_from_id(
  File "/ihme/homes/dhs2018/miniconda3/envs/onemod/lib/python3.11/site-packages/jobmon/client/status_commands.py", line 638, in resume_workflow_from_id
    swarm.from_workflow_id(workflow_id)
  File "/ihme/homes/dhs2018/miniconda3/envs/onemod/lib/python3.11/site-packages/jobmon/client/swarm/workflow_run.py", line 242, in from_workflow_id
    self.set_downstreams_from_db(chunk_size=edge_chunk_size)
  File "/ihme/homes/dhs2018/miniconda3/envs/onemod/lib/python3.11/site-packages/jobmon/client/swarm/workflow_run.py", line 431, in set_downstreams_from_db
    downstream_task_id = task_node_id_map[downstream_node_id]
                         ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
KeyError: 77687100

The reason is that the CLI resume command will re-construct a given DAG from nodes that are not in D state for efficiency, working under the assumption that a task in D state can have no upstreams. Under normal operation this is correct, since a task can only be marked D once its upstreams are D as well, but the self service CLI breaks this paradigm.

This is not a problem for workflow_args resume (i.e. build the same DAG with the same workflow args), since the entire DAG is built instead of a partial one.

Handle this edge case in the swarm from_workflow_id method.