Open oleweidner opened 10 years ago
The complete job script:
import os
import commands
import sys
import pilot
import traceback
""" DESCRIPTION: Tutorial 1: A Simple Workload
Note: User must edit USER VARIABLES section
This example will not run if these values are not set.
"""
# ---------------- BEGIN REQUIRED PILOT SETUP -----------------
# Distributed Coordination Service - Redis server and password
REDIS_PWD = "ILikeBigJob_wITH-REdIS"# Fill in the password to your redis server
REDIS_URL = "redis://%s@gw68.quarry.iu.teragrid.org:6379" % REDIS_PWD
# Resource Information
HOSTNAME = "localhost"# Remote Resource URL
USER_NAME = "scamicha"# Username on the remote resource
SAGA_ADAPTOR = "pbs"# Name of the SAGA adaptor, e.g. fork, sge, pbs, slurm, etc.
# NOTE: See complete list of BigJob supported SAGA adaptors at:
# http://saga-project.github.io/BigJob/sphinxdoc/tutorial/table.html
# Fill in queue and allocation for the given resource
# Note: Set fields to "None" if not applicable
QUEUE = "batch" #Add queue you want to use
PROJECT = "None"# Add project / allocation / account to charge
WALLTIME = 1440# Maximum Runtime (minutes) for the Pilot Job
WORKDIR = os.getenv("HOME")+"/agent" # Path of Resource Working Directory
# This is the directory where BigJob will store its output and error files
SPMD_VARIATION = "None"# Specify the WAYNESS of SGE clusters ONLY, valid input '12way' for example
PROCESSES_PER_NODE = 8# Valid on PBS clusters ONLY - this is the number of processors per node. One processor core is treated as one processor on PBS; e.g. a node with 8 cores has a maximum ppn=8
PILOT_SIZE = 128# Number of cores required for the Pilot-Job
# Job Information
datadir = "/N/dc2/projects/BDBS/cijohnson/"
files = []
os.chdir(datadir)
input = open('files.todo.01','r')
for line in input:
files.append(line)
NUMBER_JOBS=len(files)
# Continue to USER DEFINED TASK DESCRIPTION to add
# the required information about the individual tasks.
# ---------------- END REQUIRED PILOT SETUP -----------------
#
def main():
try:
# this describes the parameters and requirements for our pilot job
pilot_description = pilot.PilotComputeDescription()
pilot_description.service_url = "%s://%s@%s" % (SAGA_ADAPTOR,USER_NAME,HOSTNAME)
pilot_description.queue = QUEUE
pilot_description.number_of_processes = PILOT_SIZE
pilot_description.working_directory = WORKDIR
pilot_description.walltime = WALLTIME
pilot_description.processes_per_node = PROCESSES_PER_NODE
pilot_description.spmd_variation = SPMD_VARIATION
# create a new pilot job
pilot_compute_service = pilot.PilotComputeService(REDIS_URL)
pilotjob = pilot_compute_service.create_pilot(pilot_description)
# submit tasks to pilot job
tasks = list()
for i in range(0,NUMBER_JOBS-1):
directory = files[i].rsplit('/',1)[0]
file = files[i].rsplit('/',1)[1]
# -------- BEGIN USER DEFINED TASK DESCRIPTION --------- #
task_desc = pilot.ComputeUnitDescription()
task_desc.executable = datadir+'/dorunner.sh'
task_desc.arguments = [datadir+directory+' '+file]
task_desc.environment = {'TASK_NO': i}
task_desc.number_of_processes = 1
task_desc.spmd_variation = 'single' # Valid values are single or mpi
task_desc.working_directory=datadir+'/'+directory
task_desc.output = file.rsplit('.',1)[0]+".out"
task_desc.error = file.rsplit('.',1)[0]+".err"
# -------- END USER DEFINED TASK DESCRIPTION --------- #
task = pilotjob.submit_compute_unit(task_desc)
print "* Submitted task '%s' with id '%s' to %s" % (i, task.get_id(), HOSTNAME)
tasks.append(task)
print "Waiting for tasks to finish..."
pilotjob.wait()
return(0)
except Exception, ex:
print "AN ERROR OCCURRED: %s" % ((str(ex)))
# print a stack trace in case of an exception -
# this can be helpful for debugging the problem
traceback.print_exc()
return(-1)
finally:
# alway try to shut down pilots, otherwise jobs might end up
# lingering in the queue
print ("Terminating BigJob...")
pilotjob.cancel()
pilot_compute_service.cancel()
if __name__ == "__main__":
sys.exit(main())
Sorry, I cannot replicate this based on this script. I neither have the executable nor the input files. This needs to be narrowed down.
Hi there,
I'm the user that originally wrote into the mailing list with this issue. I don't think you'll be able to exactly replicate this problem as I was attempting to run ~120K subjobs and the input data set is a little over 4TB in size. I'd be happy to get it to you but, it's probably technical infeasible at best. I was able to run the pilot job with the debug level at 5. The log file is located at https://iu.box.com/s/3611nik4aoop686vbrn9
This error was reported in bigjob-users by Scott Michael:
KeyError in agent:
Strange CU description in Redis:
instead of:
this: