saga-project / BigJob

SAGA-based Pilot-Job Implementation for Compute and Data
http://saga-project.github.com/BigJob/
Other
19 stars 8 forks source link

KeyError job_dict["state"] + strange DB Entry #176

Open oleweidner opened 10 years ago

oleweidner commented 10 years ago

This error was reported in bigjob-users by Scott Michael:

KeyError in agent:

  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread
    if(job_dict["state"]==str(bigjob.state.Unknown)):
KeyError: 'state'
Traceback (most recent call last):
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "/N/soft/rhel6/python/2.7.3/lib/python2.7/site-packages/bigjob/bigjob_agent.py", line 720, in start_new_job_in_thread
    if(job_dict["state"]==str(bigjob.state.Unknown)):
KeyError: 'state'

Strange CU description in Redis:

instead of:

 {'Executable': '/N/dc2/projects/BDBS/cijohnson//dorunner.sh', 
'WorkingDirectory': 
'/N/dc2/projects/BDBS/cijohnson//./lp4.0058bm6.8292', 
'NumberOfProcesses': '1', 'start_time': '1390836474.31', 
'Environment': "['TASK_NO=4687']", 'state': 'Unknown', 'Arguments': 
"['/N/dc2/projects/BDBS/cijohnson/./lp4.0058bm6.8292 
tu1783717_58.in\\n']", 'Error': 'tu1783717_58.err', 'Output': 
'tu1783717_58.out', 'job-id': 
'sj-976b4976-8767-11e3-adde-001fc6d94bec', 'SPMDVariation': 'single'} 

this:

 {'a': 'd', 'c': '-', 'b': 'e', 'd': 'e', 'f': 'c', '-': '0', '3': 
'-', '1': 'e', '0': '1', 's': 'j', '7': 'f', '6': 'd', '9': '4', '8': 
'7'} 
oleweidner commented 10 years ago

The complete job script:

import os
import commands
import sys
import pilot
import traceback

""" DESCRIPTION: Tutorial 1: A Simple Workload 
Note: User must edit USER VARIABLES section
This example will not run if these values are not set.
"""

# ---------------- BEGIN REQUIRED PILOT SETUP -----------------

# Distributed Coordination Service - Redis server and password
REDIS_PWD   = "ILikeBigJob_wITH-REdIS"# Fill in the password to your redis server
REDIS_URL   = "redis://%s@gw68.quarry.iu.teragrid.org:6379" % REDIS_PWD

# Resource Information
HOSTNAME     = "localhost"# Remote Resource URL
USER_NAME    = "scamicha"# Username on the remote resource
SAGA_ADAPTOR = "pbs"# Name of the SAGA adaptor, e.g. fork, sge, pbs, slurm, etc.
# NOTE: See complete list of BigJob supported SAGA adaptors at:
# http://saga-project.github.io/BigJob/sphinxdoc/tutorial/table.html

# Fill in queue and allocation for the given resource 
# Note: Set fields to "None" if not applicable
QUEUE        = "batch" #Add queue you want to use
PROJECT      = "None"# Add project / allocation / account to charge

WALLTIME     = 1440# Maximum Runtime (minutes) for the Pilot Job

WORKDIR      = os.getenv("HOME")+"/agent" # Path of Resource Working Directory
# This is the directory where BigJob will store its output and error files

SPMD_VARIATION = "None"# Specify the WAYNESS of SGE clusters ONLY, valid input '12way' for example

PROCESSES_PER_NODE = 8# Valid on PBS clusters ONLY - this is the number of processors per node. One processor core is treated as one processor on PBS; e.g. a node with 8 cores has a maximum ppn=8

PILOT_SIZE = 128# Number of cores required for the Pilot-Job

# Job Information

datadir = "/N/dc2/projects/BDBS/cijohnson/"
files = []
os.chdir(datadir)
input = open('files.todo.01','r')
for line in input:
    files.append(line)
NUMBER_JOBS=len(files)

# Continue to USER DEFINED TASK DESCRIPTION to add 
# the required information about the individual tasks.

# ---------------- END REQUIRED PILOT SETUP -----------------
#

def main():
    try:
        # this describes the parameters and requirements for our pilot job
        pilot_description = pilot.PilotComputeDescription()
        pilot_description.service_url = "%s://%s@%s" %  (SAGA_ADAPTOR,USER_NAME,HOSTNAME)
        pilot_description.queue = QUEUE
        pilot_description.number_of_processes = PILOT_SIZE
        pilot_description.working_directory = WORKDIR
        pilot_description.walltime = WALLTIME
    pilot_description.processes_per_node = PROCESSES_PER_NODE
    pilot_description.spmd_variation = SPMD_VARIATION

        # create a new pilot job
        pilot_compute_service = pilot.PilotComputeService(REDIS_URL)
        pilotjob = pilot_compute_service.create_pilot(pilot_description)

        # submit tasks to pilot job
        tasks = list()
        for i in range(0,NUMBER_JOBS-1):
        directory = files[i].rsplit('/',1)[0]
        file = files[i].rsplit('/',1)[1]
    # -------- BEGIN USER DEFINED TASK DESCRIPTION --------- #
            task_desc = pilot.ComputeUnitDescription()
            task_desc.executable = datadir+'/dorunner.sh'
            task_desc.arguments = [datadir+directory+' '+file]
            task_desc.environment = {'TASK_NO': i}
            task_desc.number_of_processes = 1
        task_desc.spmd_variation = 'single' # Valid values are single or mpi
        task_desc.working_directory=datadir+'/'+directory
            task_desc.output = file.rsplit('.',1)[0]+".out" 
            task_desc.error = file.rsplit('.',1)[0]+".err"
    # -------- END USER DEFINED TASK DESCRIPTION --------- #

            task = pilotjob.submit_compute_unit(task_desc)
            print "* Submitted task '%s' with id '%s' to %s" % (i, task.get_id(), HOSTNAME)
            tasks.append(task)

        print "Waiting for tasks to finish..."
        pilotjob.wait()

        return(0)

    except Exception, ex:
            print "AN ERROR OCCURRED: %s" % ((str(ex)))
            # print a stack trace in case of an exception -
            # this can be helpful for debugging the problem
            traceback.print_exc()
            return(-1)

    finally:
        # alway try to shut down pilots, otherwise jobs might end up
        # lingering in the queue
        print ("Terminating BigJob...")
        pilotjob.cancel()
        pilot_compute_service.cancel()

if __name__ == "__main__":
    sys.exit(main())
drelu commented 10 years ago

Sorry, I cannot replicate this based on this script. I neither have the executable nor the input files. This needs to be narrowed down.

scamicha commented 10 years ago

Hi there,

I'm the user that originally wrote into the mailing list with this issue. I don't think you'll be able to exactly replicate this problem as I was attempting to run ~120K subjobs and the input data set is a little over 4TB in size. I'd be happy to get it to you but, it's probably technical infeasible at best. I was able to run the pilot job with the debug level at 5. The log file is located at https://iu.box.com/s/3611nik4aoop686vbrn9