radical-cybertools / radical.entk

The RADICAL Ensemble Toolkit
https://radical-cybertools.github.io/entk/index.html
Other
28 stars 17 forks source link

EnTK not waiting for stage tasks to finish #599

Closed AymenFJA closed 2 years ago

AymenFJA commented 2 years ago

I am reporting an unusual behavior of EnTK as I was running a workflow with a synthetic workload (sleep 5) for a new use case. The program exists fine, indicating the tasks within the pipeline within a stage(s) are DONE successfully. But the profiles show something else. EnTK tasks are done instantly without waiting for the task to finish. Below attached are the tasks sandboxes.

  python               : /home/afa64/miniconda3/envs/conda_rnaseq/bin/python3
  pythonpath           :
  version              : 3.8.8
  virtualenv           : conda_rnaseq

  radical.entk         : 1.11.0-v1.11.0-7-g3b668d4@devel
  radical.gtod         : 1.6.7
  radical.pilot        : 1.10.0-scalems-stable-114-gf66a3d4@devel
  radical.saga         : 1.8.0
  radical.utils        : 1.9.0
if __name__ == '__main__':

    # Create a Pipeline object
    p = Pipeline()

    # Create a Stage object
    s = Stage()

    # Create a Task object
    t = Task()
    t.name = 'my.first.task'        # Assign a name to the task (optional, do not use ',' or '_')
    t.executable = 'sleep 5'   # Assign executable to the task

    # Add Task to the Stage
    s.add_tasks(t)

    s1 = Stage()

    # Create a Task object
    t1 = Task()
    t1.name = 'my.sec.task'        # Assign a name to the task (optional, do not use ',' or '_')
    t1.executable = 'sleep 5'   # Assign executable to the task

    # Add Task to the Stage
    s1.add_tasks(t1)

    # Add Stage to the Pipeline
    p.add_stages(s)
    p.add_stages(s1)

    # Create Appliication Manager
    appman = AppManager(hostname=hostname, port=port, username=username,
             password=password)

    # Create a dictionary describe four mandatory keys:
    # resource, walltime, and cpus
    # resource is 'local.localhost' to execute locally
    res_dict = {'resource'      : "rutgers.amarel",
                'exit_on_error' : True,
                'access_schema' : "ssh",
                'walltime'      : 30,
                'queue'         : "XXXXX",
                'cpus'          : 16} # 1 node on Amarel with 256 GB of memory
                                      # should be enough to run the pipeline
    # Assign resource request description to the Application Manager
    appman.resource_desc = res_dict

task.zip

AymenFJA commented 2 years ago

This issue is related to RADICAL-Pilot and is fixed with the latest PR mentioned above. Closing for now.