Closed RichardScottOZ closed 3 months ago
There is a RUNTIME_TIMEOUT_MAX of two hours in the code for the batch backend
Batch has no actual limit so if wanting to run longer things, change or bypass this?
<html>
<body>
<!--StartFragment-->
__LITHOPS_ACTION | run_job
-- | --
__LITHOPS_PAYLOAD | {"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 1},
<!--EndFragment-->
</body>
</html>
Not sure where 1800 comes from as not 180 or 3600 - but is 180 x 10 [which is a cpu default]
only other thing I can think of is a timeout set when built the container is there somewhere?
the runtime .json file in s3 says 3600 anyway
"python_version": "3.10", "lithops_version": "3.3.0", "runtime_timeout": 3600}
I choose a new memory limit (double) as I have a subset still to do that need more - kept the config timeout at 3600 - and it is still doing 1800 timeouts.
built a container with a different name and stll 1800 timeout - so I must have something else wrong
environ({'AWS_BATCH_JOB_ARRAY_SIZE': '10', 'HOSTNAME':
etc.
'AWS_BATCH_JOB_ARRAY_INDEX': '9', 'LANG': 'C.UTF-8', 'AWS_BATCH_JOB_ATTEMPT': '1', '__LITHOPS_PAYLOAD': '{"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 14}
I have not got my head around this certainly yet but from invokers.py - should this have a job.runtime_timeout?
def _create_payload(self, job):
"""
Creates the default pyload dictionary
"""
payload = {
'config': self.config,
'chunksize': job.chunksize,
'log_level': self.log_level,
'func_name': job.function_name,
'func_key': job.func_key,
'data_key': job.data_key,
'extra_env': job.extra_env,
'total_calls': job.total_calls,
'execution_timeout': job.execution_timeout,
'data_byte_ranges': job.data_byte_ranges,
'executor_id': job.executor_id,
'job_id': job.job_id,
'job_key': job.job_key,
'max_workers': self.max_workers,
'call_ids': None,
'host_submit_tstamp': time.time(),
'lithops_version': __version__,
'runtime_name': job.runtime_name,
'runtime_memory': job.runtime_memory,
'worker_processes': job.worker_processes
}
whatever is going to task - which I think is a child of the main job?
def run_task(task):
"""
Runs a single job within a separate process
"""
setup_lithops_logger(task.log_level)
backend = os.environ.get('__LITHOPS_BACKEND', '')
logger.info(f"Lithops v{__version__} - Starting {backend} execution")
logger.info(f"Execution ID: {task.job_key}/{task.call_id}")
env = task.extra_env
env['LITHOPS_CONFIG'] = json.dumps(task.config)
env['__LITHOPS_SESSION_ID'] = '-'.join([task.job_key, task.call_id])
os.environ.update(env)
storage_config = extract_storage_config(task.config)
internal_storage = InternalStorage(storage_config)
call_status = create_call_status(task, internal_storage)
runtime_name = task.runtime_name
memory = task.runtime_memory
timeout = task.execution_timeout
if task.runtime_memory:
logger.debug(f'Runtime: {runtime_name} - Memory: {memory}MB - Timeout: {timeout} seconds')
else:
logger.debug(f'Runtime: {runtime_name} - Timeout: {timeout} seconds')
job_interruped = False
timeout is the execution timeout - which is 1800 in the json - not sure where it comes from as yet
@RichardScottOZ Note that there are 2 timeouts in Lithops, one is the runtime_timeout
specified at the backend level in the config. This timeout is set during the deployment of the runtime, and it will be applied by the cloud provider. A different timeout is the execution_timeout
, which is by default 1800s. You have to set this timeout in the lithops
section of the config. This timeout, in contrast of the runtime_timeout
, is applied by Lithops.
So, based on your logs, you have to put in your config:
lithops:
execution_timeout: 3600
thank you!
I ran a batch and it seems to timeout at 1800 seconds, not the 3600 set?