vivek-bala / radical.entk

MIT License
1 stars 1 forks source link

tmgr (task manager?) in app manager incorrectly terminating? #30

Closed SrinivasMushnoori closed 6 years ago

SrinivasMushnoori commented 6 years ago

Hi,

My input script is as follows:

#!/usr/bin/env python

from radical.entk import Pipeline, Stage, Task, AppManager, ResourceManager
import os

## Uses the Pipeline of Ensembles to implement Synchronous Replica Exchange.
## There are 4 GROMACS replicas that run and exchange configurations as follows: 1 and 4, 2 and 3.
## Exchange scheme is currently hard-coded. To implement replica exchange, an Exchange method must be instantiated as a stage between two MD stages.
## This Exchange Method may be pulled from the original RepEx implementation as-is or with little modification....if we're lucky. 
## But of course, Murphy's Law exists.  

# ------------------------------------------------------------------------------
# Set default verbosity

if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
    os.environ['RADICAL_ENTK_VERBOSE'] = 'INFO'

#  Hard code the old defines/state names

if os.environ.get('RP_ENABLE_OLD_DEFINES') == None:
    os.environ['RP_ENABLE_OLD_DEFINES'] = 'True'

if os.environ.get('RADICAL_PILOT_PROFILE') == None:
    os.environ['RADICAL_PILOT_PROFILE'] = 'True'

if os.environ.get('export RADICAL_PILOT_DBURL') == None:
    os.environ['export RADICAL_PILOT_DBURL'] = "mongodb://138.201.86.166:27017/ee_exp_4c"

if __name__ == '__main__':

    # Create a Pipeline object
    p = Pipeline()

    ##########----------###########

    ###Stage1=Simulation. Stage 2=Hardcoded copy followed by simulation.

    # Create stage.

    s1 = Stage()
    s1_task_uids = []
    s2_task_uids = []
    for cnt in range(4):

        # Create a Task object
        t1 = Task() ##GROMPP
        t1.executable = ['/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d']  #MD Engine  
        t1.upload_input_data = ['in.gro', 'in.top', 'FNF.itp', 'martini_v2.2.itp', 'in.mdp'] 
        t1.pre_exec = ['module load gromacs', '/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d grompp -f in.mdp -c in.gro -o in.tpr -p in.top'] 
        t1.arguments = ['mdrun', '-s', 'in.tpr', '-deffnm', 'out']
        t1.cores = 5

        # Add the Task to the Stage
        s1.add_tasks(t1)
        s1_task_uids.append(t1.uid)

    # Add Stage to the Pipeline
    p.add_stages(s1)

        # Create another Stage object to hold checksum tasks
    s2 = Stage() #HARD-CODED EXCHANGE FOLLOWED BY MD

    # Create a Task object
    t2 = Task()
    t2.executable = ['/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d']  #MD Engine 

    # exchange happens here

    for n0 in range(1, 4):
        t2.copy_input_data += ['$Pipline_%s_Stage_%s_Task_%s/out.gro'%(p.uid, s1.uid, s1_task_uids[n0])>'in.gro', '$Pipline_%s_Stage_%s_Task_%s/in.top'%(p.uid, s1.uid, s1_task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/FF.itp'%(p.uid, s1.uid, s1_task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/martini_v2.2.itp'%(p.uid, s1.uid, s1_task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/in.mdp'%(p.uid, s1.uid, s1_task_uids[n0])]
        t2.pre_exec = ['module load gromacs', '/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d grompp -f in.mdp -c in.gro -o in.tpr -p in.top']
        t2.arguments = ['mdrun', '-s', 'in.tpr', '-deffnm', 'out']
        t2.cores = 5

        s2.add_tasks(t2)
        s2_task_uids.append(t2.uid)

    # Add Stage to the Pipeline
    p.add_stages(s2)

    # Create a dictionary describe four mandatory keys:
    # resource, walltime, cores and project
    # resource is 'local.localhost' to execute locally
    res_dict = {

            #'resource': 'local.localhost',
            'resource': 'xsede.supermic',
            'walltime': 10,
            'cores': 20,
            'access_schema': 'gsissh',
            'queue': 'workq',
            'project': 'TG-MCB090174',
    }

    # Create Resource Manager object with the above resource description
    rman = ResourceManager(res_dict)

    # Create Application Manager
    appman = AppManager()

    # Assign resource manager to the Application Manager
    appman.resource_manager = rman

    # Assign the workflow as a set of Pipelines to the Application Manager
    appman.assign_workflow(set([p]))

    # Run the Application Manager
    appman.run()

The last several lines of the terminal output are as follows:

2017-10-09 02:27:19,121: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.stage.0001 to new state SCHEDULING successful
2017-10-09 02:27:19,123: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SCHEDULING
Syncing task radical.entk.task.0004 with state SCHEDULING
Synced task radical.entk.task.0004 with state SCHEDULING
2017-10-09 02:27:19,124: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0004 to new state SCHEDULING successful
2017-10-09 02:27:19,126: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SCHEDULED
Syncing task radical.entk.task.0004 with state SCHEDULED
Synced task radical.entk.task.0004 with state SCHEDULED
2017-10-09 02:27:19,127: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0004 to new state SCHEDULED successful
2017-10-09 02:27:19,127: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SUBMITTING
Syncing task radical.entk.task.0004 with state SUBMITTING
Synced task radical.entk.task.0004 with state SUBMITTING
2017-10-09 02:27:19,128: radical.entk.task_manager: task-manager                    : MainThread     : INFO    : Transition of radical.entk.task.0004 to new state SUBMITTING successful
2017-10-09 02:27:19,128: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.stage.0001 with state SCHEDULED
2017-10-09 02:27:19,128: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: radical.entk.pipeline.0000
2017-10-09 02:27:19,129: radical.entk.task_processor: task-manager                    : MainThread     : ERROR   : Failed to resolve placeholder False, error: Expected (base) type(s) <type 'str'>, but got <type 'bool'>.
2017-10-09 02:27:19,129: radical.entk.task_processor: task-manager                    : MainThread     : ERROR   : Failed to get input list of files from task, error: Expected (base) type(s) <type 'str'>, but got <type 'bool'>.
2017-10-09 02:27:19,129: radical.entk.task_processor: task-manager                    : MainThread     : ERROR   : CU creation failed, error: Expected (base) type(s) <type 'str'>, but got <type 'bool'>.
2017-10-09 02:27:19,129: radical.entk.task_manager: task-manager                    : MainThread     : ERROR   : Task radical.entk.task.0004 submission failed, error: Expected (base) type(s) <type 'str'>, but got <type 'bool'>.
2017-10-09 02:27:19,129: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SUBMITTING
2017-10-09 02:27:19,129: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.stage.0001 to new state SCHEDULED successful
State transition done
2017-10-09 02:27:27,449: radical.entk.task_manager: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2017-10-09 02:27:27,450: radical.entk.task_manager: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2017-10-09 02:27:37,456: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Terminating tmgr process from AppManager

Beyond this, it simply gets stuck. Unsure of how to proceed.

Thanks!

SrinivasMushnoori commented 6 years ago

Reproduced for the following script:

#!/usr/bin/env python

from radical.entk import Pipeline, Stage, Task, AppManager, ResourceManager
import os

# ------------------------------------------------------------------------------
# Set default verbosity

if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
    os.environ['RADICAL_ENTK_VERBOSE'] = 'INFO'

#  Hard code the old defines/state names

if os.environ.get('RP_ENABLE_OLD_DEFINES') == None:
    os.environ['RP_ENABLE_OLD_DEFINES'] = 'True'

if os.environ.get('RADICAL_PILOT_PROFILE') == None:
    os.environ['RADICAL_PILOT_PROFILE'] = 'True'

if os.environ.get('export RADICAL_PILOT_DBURL') == None:
    os.environ['export RADICAL_PILOT_DBURL'] = "mongodb://138.201.86.166:27017/ee_exp_4c"

if __name__ == '__main__':

    # Create a Pipeline object
    p = Pipeline()
    # Bookkeeping
    stage_uids = []
    task_uids = []

    for N_Stg in range(10):
        s =  Stage() ## initialization
        if N_Stg == 0:
            for N_Tsk in range(4):
                t = Task()
                t.executable = ['/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d']  #MD Engine  
                t.upload_input_data = ['in.gro', 'in.top', 'FNF.itp', 'martini_v2.2.itp', 'in.mdp'] 
                t.pre_exec = ['module load gromacs', '/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d grompp -f in.mdp -c in.gro -o in.tpr -p in.top'] 
                t.arguments = ['mdrun', '-s', 'in.tpr', '-deffnm', 'out']
                t.cores = 5
                s.add_tasks(t)
                task_uids.append(t.uid)
            p.add_stages(s)
            stage_uids.append(s.uid) 

        else:

            for n0 in range(4):
                t = Task()
                t.executable = ['/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d']  #MD Engine  
                t.copy_input_data += ['$Pipline_%s_Stage_%s_Task_%s/out.gro'%(p.uid, stage_uids[N_Stg-1], task_uids[n0])>'in.gro', '$Pipline_%s_Stage_%s_Task_%s/in.top'%(p.uid, stage_uids[N_Stg-1], task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/FF.itp'%(p.uid, stage_uids[N_Stg-1], task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/martini_v2.2.itp'%(p.uid, stage_uids[N_Stg-1], task_uids[n0]),  '$Pipline_%s_Stage_%s_Task_%s/in.mdp'%(p.uid, stage_uids[N_Stg-1], task_uids[n0])]
                t.pre_exec = ['module load gromacs', '/usr/local/packages/gromacs/5.1.4/INTEL-140-MVAPICH2-2.0/bin/gmx_mpi_d grompp -f in.mdp -c in.gro -o in.tpr -p in.top'] 
                t.arguments = ['mdrun', '-s', 'in.tpr', '-deffnm', 'out']
                t.cores = 5
                s.add_tasks(t)
                task_uids.append(t.uid)
            p.add_stages(s)
            stage_uids.append(s)          

    # Create a dictionary describe four mandatory keys:
    # resource, walltime, cores and project
    # resource is 'local.localhost' to execute locally
    res_dict = {

            #'resource': 'local.localhost',
            'resource': 'xsede.supermic',
            'walltime': 30,
            'cores': 20,
            'access_schema': 'gsissh',
            'queue': 'workq',
            'project': 'TG-MCB090174',
    }

    # Create Resource Manager object with the above resource description
    rman = ResourceManager(res_dict)

    # Create Application Manager
    appman = AppManager()

    # Assign resource manager to the Application Manager
    appman.resource_manager = rman

    # Assign the workflow as a set of Pipelines to the Application Manager
    appman.assign_workflow(set([p]))

    # Run the Application Manager
    appman.run()

Output:

2017-10-09 02:53:20,454: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0007 with state SCHEDULED
Syncing task radical.entk.task.0007 with state SCHEDULED
Synced task radical.entk.task.0007 with state SCHEDULED
2017-10-09 02:53:20,455: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0007 to new state SCHEDULED successful
2017-10-09 02:53:20,457: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SCHEDULING
Syncing task radical.entk.task.0004 with state SCHEDULING
Synced task radical.entk.task.0004 with state SCHEDULING
2017-10-09 02:53:20,457: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0004 to new state SCHEDULING successful
2017-10-09 02:53:20,459: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0004 with state SCHEDULED
Syncing task radical.entk.task.0004 with state SCHEDULED
Synced task radical.entk.task.0004 with state SCHEDULED
2017-10-09 02:53:20,460: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0004 to new state SCHEDULED successful
2017-10-09 02:53:20,461: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0005 with state SCHEDULING
Syncing task radical.entk.task.0005 with state SCHEDULING
Synced task radical.entk.task.0005 with state SCHEDULING
2017-10-09 02:53:20,462: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0005 to new state SCHEDULING successful
2017-10-09 02:53:20,464: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.task.0005 with state SCHEDULED
Syncing task radical.entk.task.0005 with state SCHEDULED
Synced task radical.entk.task.0005 with state SCHEDULED
2017-10-09 02:53:20,465: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.task.0005 to new state SCHEDULED successful
2017-10-09 02:53:20,467: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Received radical.entk.stage.0001 with state SCHEDULED
2017-10-09 02:53:20,467: radical.entk.appmanager: MainProcess                     : synchronizer-thread: INFO    : Found parent pipeline: radical.entk.pipeline.0000
2017-10-09 02:53:20,467: radical.entk.wfprocessor: wfprocessor                     : enqueue-thread : INFO    : Transition of radical.entk.stage.0001 to new state SCHEDULED successful
State transition done
2017-10-09 02:53:20,757: radical.entk.task_manager: MainProcess                     : heartbeat      : INFO    : Received heartbeat response
2017-10-09 02:53:20,758: radical.entk.task_manager: MainProcess                     : heartbeat      : INFO    : Sent heartbeat request
2017-10-09 02:53:30,769: radical.entk.appmanager: MainProcess                     : MainThread     : INFO    : Terminating tmgr process from AppManager

After which it simply stops moving.

SrinivasMushnoori commented 6 years ago

This issue is resolved. The cause was incorrect referencing of UID's in the data staging step.