lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
315 stars 103 forks source link

AWS batch runtime timeout smaller than set in config #1362

Closed RichardScottOZ closed 3 months ago

RichardScottOZ commented 3 months ago

I ran a batch and it seems to timeout at 1800 seconds, not the 3600 set?

2024-05-29 06:59:36,140 [INFO] config.py:139 -- Lithops v3.3.0 - Python3.10
2024-05-29 06:59:36,543 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
2024-05-29 06:59:38,905 [INFO] aws_batch.py:89 -- AWS Batch client created - Region: us-west-2 - Env: FARGATE_SPOT
2024-05-29 06:59:38,906 [INFO] invokers.py:107 -- ExecutorID 4003aa-1 | JobID M000 - Selected Runtime: book-mentat-runtime-batch:01 - 2048MB
2024-05-29 06:59:39,129 [INFO] invokers.py:115 -- Runtime book-mentat-runtime-batch:01 with 2048MB is not yet deployed
2024-05-29 06:59:39,129 [INFO] aws_batch.py:431 -- Deploying runtime: book-mentat-runtime-batch:01 - Memory: 2048 Timeout: 3600
2024-05-29 06:59:48,200 [INFO] aws_batch.py:334 -- Extracting metadata from: book-mentat-runtime-batch:01
2024-05-29 07:00:53,098 [INFO] invokers.py:174 -- ExecutorID 4003aa-1 | JobID M000 - Starting function invocation: process_pdf() - Total: 140 activations
2024-05-29 07:00:53,362 [INFO] invokers.py:213 -- ExecutorID 4003aa-1 | JobID M000 - View execution logs at /tmp/lithops-richard/logs/4003aa-1-M000.log
2024-05-29 07:00:53,384 [INFO] executors.py:491 -- ExecutorID 4003aa-1 - Getting results from 140 function activations
2024-05-29 07:00:53,384 [INFO] wait.py:101 -- ExecutorID 4003aa-1 - Waiting for 140 function activations to complete
``

```bash
<html>
<body>
<!--StartFragment-->
2024-05-28T21:31:41.119Z | 2024-05-28 21:31:41,119 [INFO] entry_point.py:52 -- Lithops v3.3.0 - Starting AWS Batch execution
-- | --
  | 2024-05-28T21:31:41.119Z | 2024-05-28 21:31:41,119 [INFO] entry_point.py:56 -- Job index 0
  | 2024-05-28T21:31:41.295Z | 2024-05-28 21:31:41,295 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
  | 2024-05-28T21:31:41.330Z | 2024-05-28 21:31:41,330 [INFO] utils.py:50 -- Getting function and modules
  | 2024-05-28T21:31:41.330Z | 2024-05-28 21:31:41,330 [INFO] utils.py:65 -- Loading lithops.jobs/4003aa-1/673af84e09230a54d27f7a04b3381bff.func.pickle from storage
  | 2024-05-28T21:31:41.349Z | 2024-05-28 21:31:41,349 [INFO] utils.py:113 -- Loading function data parameters from storage
  | 2024-05-28T21:31:41.382Z | 2024-05-28 21:31:41,382 [INFO] handler.py:75 -- Tasks received: 14 - Worker processes: 1
  | 2024-05-28T21:31:41.382Z | 2024-05-28 21:31:41,382 [INFO] handler.py:117 -- Worker process 0 started
  | 2024-05-28T21:31:41.383Z | environ({'AWS_BATCH_JOB_ARRAY_SIZE': '10', 'HOSTNAME': 'ip-172-31-45-86.us-west-2.compute.internal', 'PYTHON_PIP_VERSION': '23.0.1', 'HOME': '/root', 'GPG_KEY': 'A035C8C19219BA821ECEA86B64E628F8D684696D', 'AWS_CONTAINER_CREDENTIALS_RELATIVE_URI': '/v2/credentials/35e6cbc6-55cf-4602-b944-c2826d5d524c', 'AWS_EXECUTION_ENV': 'AWS_ECS_FARGATE', 'AWS_BATCH_JOB_ID': 'ef105b14-1e8b-4bf8-822f-57550083660a:0', 'ECS_AGENT_URI': 'http://169.254.170.2/api/70a276d5257c477eb96bf1cea050d60c-2470140894', 'AWS_DEFAULT_REGION': 'us-west-2', 'PYTHON_GET_PIP_URL': 'https://github.com/pypa/get-pip/raw/0d8570dc44796f4369b652222cf176b3db6ac70e/public/get-pip.py', 'AWS_BATCH_JQ_NAME': 'lithops_v330_vztq_FARGATE-SPOT_queue', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/70a276d5257c477eb96bf1cea050d60c-2470140894', 'APP_HOME': '/lithops', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/70a276d5257c477eb96bf1cea050d60c-2470140894', 'PATH': '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'AWS_BATCH_JOB_ARRAY_INDEX': '0', 'LANG': 'C.UTF-8', 'AWS_BATCH_JOB_ATTEMPT': '1', '__LITHOPS_PAYLOAD': '{"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 1}, "aws_batch": {"region": "us-west-2", "assign_public_ip": true, "execution_role": "arn:aws:iam::958539196701:role/ecsTaskExecutionRoleLithops", "instance_role": "arn:aws:iam::958539196701:role/ecsInstanceRolelithops", "job_role": "arn:aws:iam::958539196701:role/ecsTaskJobRolelithops", "subnets": ["subnet-70515309", "subnet-4ad6fa01", "subnet-4a342910", "subnet-bd069396"], "security_groups": ["sg-0eabf99e212627078"], "runtime": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "runtime_timeout": 3600, "worker_processes": 1, "container_vcpus": 1, "service_role": "None", "env_total_cpus": 140, "env_type": "FARGATE_SPOT", "env_max_cpus": 10, "max_workers": 10, "user_agent": "lithops/3.3.0"}, "aws": {"region": "us-west-2"}, "aws_s3": {"storage_bucket": "lithops-data-books", "region": "us-west-2", "user_agent": "lithops/3.3.0"}}, "chunksize": 14, "log_level": 20, "func_name": "process_pdf", "func_key": "lithops.jobs/4003aa-1/673af84e09230a54d27f7a04b3381bff.func.pickle", "data_key": "lithops.jobs/4003aa-1-M000/aggdata.pickle", "extra_env": {}, "total_calls": 140, "execution_timeout": 1800, "data_byte_ranges": [[0, 158], [159, 319], [320, 474], [475, 620], [621, 779], [780, 929], [930, 1086], [1087, 1224], [1225, 1373], [1374, 1522], [1523, 1681], [1682, 1836], [1837, 1992], [1993, 2136], [2137, 2284], [2285, 2435], [2436, 2577], [2578, 2725], [2726, 2881], [2882, 3017], [3018, 3165], [3166, 3316], [3317, 3470], [3471, 3625], [3626, 3780], [3781, 3933], [3934, 4086], [4087, 4224], [4225, 4381], [4382, 4537], [4538, 4683], [4684, 4840], [4841, 4994], [4995, 5155], [5156, 5322], [5323, 5474], [5475, 5626], [5627, 5778], [5779, 5930], [5931, 6060], [6061, 6229], [6230, 6398], [6399, 6524], [6525, 6675], [6676, 6825], [6826, 6973], [6974, 7121], [7122, 7245], [7246, 7383], [7384, 7493], [7494, 7643], [7644, 7793], [7794, 7943], [7944, 8093], [8094, 8248], [8249, 8402], [8403, 8555], [8556, 8701], [8702, 8868], [8869, 9035], [9036, 9183], [9184, 9329], [9330, 9478], [9479, 9626], [9627, 9772], [9773, 9896], [9897, 10046], [10047, 10168], [10169, 10298], [10299, 10440], [10441, 10596], [10597, 10746], [10747, 10906], [10907, 11067], [11068, 11226], [11227, 11372], [11373, 11520], [11521, 11677], [11678, 11795], [11796, 11931], [11932, 12067], [12068, 12221], [12222, 12388], [12389, 12545], [12546, 12691], [12692, 12843], [12844, 12983], [12984, 13130], [13131, 13277], [13278, 13428], [13429, 13577], [13578, 13724], [13725, 13872], [13873, 14010], [14011, 14152], [14153, 14310], [14311, 14468], [14469, 14626], [14627, 14776], [14777, 14922], [14923, 15080], [15081, 15238], [15239, 15396], [15397, 15534], [15535, 15670], [15671, 15810], [15811, 15954], [15955, 16086], [16087, 16234], [16235, 16383], [16384, 16540], [16541, 16696], [16697, 16852], [16853, 17008], [17009, 17167], [17168, 17324], [17325, 17488], [17489, 17662], [17663, 17836], [17837, 18010], [18011, 18170], [18171, 18344], [18345, 18519], [18520, 18679], [18680, 18832], [18833, 18956], [18957, 19110], [19111, 19258], [19259, 19413], [19414, 19569], [19570, 19703], [19704, 19847], [19848, 19998], [19999, 20124], [20125, 20274], [20275, 20429], [20430, 20573], [20574, 20719], [20720, 20859], [20860, 21014]], "executor_id": "4003aa-1", "job_id": "M000", "job_key": "4003aa-1-M000", "max_workers": 10, "call_ids": ["00000", "00001", "00002", "00003", "00004", "00005", "00006", "00007", "00008", "00009", "00010", "00011", "00012", "00013", "00014", "00015", "00016", "00017", "00018", "00019", "00020", "00021", "00022", "00023", "00024", "00025", "00026", "00027", "00028", "00029", "00030", "00031", "00032", "00033", "00034", "00035", "00036", "00037", "00038", "00039", "00040", "00041", "00042", "00043", "00044", "00045", "00046", "00047", "00048", "00049", "00050", "00051", "00052", "00053", "00054", "00055", "00056", "00057", "00058", "00059", "00060", "00061", "00062", "00063", "00064", "00065", "00066", "00067", "00068", "00069", "00070", "00071", "00072", "00073", "00074", "00075", "00076", "00077", "00078", "00079", "00080", "00081", "00082", "00083", "00084", "00085", "00086", "00087", "00088", "00089", "00090", "00091", "00092", "00093", "00094", "00095", "00096", "00097", "00098", "00099", "00100", "00101", "00102", "00103", "00104", "00105", "00106", "00107", "00108", "00109", "00110", "00111", "00112", "00113", "00114", "00115", "00116", "00117", "00118", "00119", "00120", "00121", "00122", "00123", "00124", "00125", "00126", "00127", "00128", "00129", "00130", "00131", "00132", "00133", "00134", "00135", "00136", "00137", "00138", "00139"], "host_submit_tstamp": 1716931853.098693, "lithops_version": "3.3.0", "runtime_name": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "worker_processes": 1}', 'PYTHON_VERSION': '3.10.12', 'PYTHON_SETUPTOOLS_VERSION': '65.5.1', 'AWS_REGION': 'us-west-2', '__LITHOPS_ACTION': 'run_job', 'PWD': '/lithops', 'PYTHON_GET_PIP_SHA256': '96461deced5c2a487ddc65207ec5a9cffeca0d34e7af7ea1afc470ff0d746207', 'AWS_BATCH_CE_NAME': 'lithops_v330_vztq_FARGATE-SPOT_env'})
  | 2024-05-28T21:31:41.383Z | 2024-05-28 21:31:41,383 [INFO] handler.py:177 -- Lithops v3.3.0 - Starting AWS Batch execution
  | 2024-05-28T21:31:41.383Z | 2024-05-28 21:31:41,383 [INFO] handler.py:178 -- Execution ID: 4003aa-1-M000/00000
  | 2024-05-28T21:31:41.453Z | 2024-05-28 21:31:41,453 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
  | 2024-05-28T21:31:41.594Z | 2024-05-28 21:31:41,594 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
  | 2024-05-28T21:31:41.595Z | 2024-05-28 21:31:41,595 [INFO] jobrunner.py:233 -- Going to execute 'process_pdf()'
  | 2024-05-28T21:31:41.595Z | ---------------------- FUNCTION LOG ----------------------
  | 2024-05-28T21:31:41.595Z | FILE BEING PROCESSED: books/Calibre Library/Aaron Rosenberg/Non-Fiction - GM Mastery Gamemaste (20989)/Non-Fiction - GM Mastery Gamema - Aaron Rosenberg.pdf
  | 2024-05-28T21:31:41.954Z | FILE PROCESSING: books/Calibre Library/Aaron Rosenberg/Non-Fiction - GM Mastery Gamemaste (20989)/Non-Fiction - GM Mastery Gamema - Aaron Rosenberg.pdf 20989
  | 2024-05-28T21:31:42.128Z | 2024-05-28 21:31:42,128 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
  | 2024-05-28T21:31:42.541Z | File has been written to books/Calibre Library/Aaron Rosenberg/Non-Fiction - GM Mastery Gamemaste (20989)/Non-Fiction - GM Mastery Gamema - Aaron Rosenberg.pdf
  | 2024-05-28T22:01:41.615Z | ----------------------- EXCEPTION !-----------------------
  | 2024-05-28T22:01:41.615Z | Traceback (most recent call last):
  | 2024-05-28T22:01:41.615Z | File "/lithops/lithops/worker/handler.py", line 242, in run_task
  | 2024-05-28T22:01:41.615Z | raise TimeoutError('HANDLER', msg)
  | 2024-05-28T22:01:41.615Z | TimeoutError: [Errno HANDLER] Function exceeded maximum time of 1800 seconds and was killed

<!--EndFragment-->
</body>
</html>
RichardScottOZ commented 3 months ago

There is a RUNTIME_TIMEOUT_MAX of two hours in the code for the batch backend

Batch has no actual limit so if wanting to run longer things, change or bypass this?

RichardScottOZ commented 3 months ago
<html>
<body>
<!--StartFragment-->
__LITHOPS_ACTION | run_job
-- | --
__LITHOPS_PAYLOAD | {"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 1},

<!--EndFragment-->
</body>
</html>
RichardScottOZ commented 3 months ago

Not sure where 1800 comes from as not 180 or 3600 - but is 180 x 10 [which is a cpu default]

RichardScottOZ commented 3 months ago

only other thing I can think of is a timeout set when built the container is there somewhere?

the runtime .json file in s3 says 3600 anyway

"python_version": "3.10", "lithops_version": "3.3.0", "runtime_timeout": 3600}
RichardScottOZ commented 3 months ago

I choose a new memory limit (double) as I have a subset still to do that need more - kept the config timeout at 3600 - and it is still doing 1800 timeouts.

RichardScottOZ commented 3 months ago

built a container with a different name and stll 1800 timeout - so I must have something else wrong

RichardScottOZ commented 3 months ago
environ({'AWS_BATCH_JOB_ARRAY_SIZE': '10', 'HOSTNAME':
etc.
'AWS_BATCH_JOB_ARRAY_INDEX': '9', 'LANG': 'C.UTF-8', 'AWS_BATCH_JOB_ATTEMPT': '1', '__LITHOPS_PAYLOAD': '{"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 14}
RichardScottOZ commented 3 months ago

I have not got my head around this certainly yet but from invokers.py - should this have a job.runtime_timeout?

def _create_payload(self, job):
        """
        Creates the default pyload dictionary
        """
        payload = {
            'config': self.config,
            'chunksize': job.chunksize,
            'log_level': self.log_level,
            'func_name': job.function_name,
            'func_key': job.func_key,
            'data_key': job.data_key,
            'extra_env': job.extra_env,
            'total_calls': job.total_calls,
            'execution_timeout': job.execution_timeout,
            'data_byte_ranges': job.data_byte_ranges,
            'executor_id': job.executor_id,
            'job_id': job.job_id,
            'job_key': job.job_key,
            'max_workers': self.max_workers,
            'call_ids': None,
            'host_submit_tstamp': time.time(),
            'lithops_version': __version__,
            'runtime_name': job.runtime_name,
            'runtime_memory': job.runtime_memory,
            'worker_processes': job.worker_processes
        }

whatever is going to task - which I think is a child of the main job?

def run_task(task):
    """
    Runs a single job within a separate process
    """
    setup_lithops_logger(task.log_level)

    backend = os.environ.get('__LITHOPS_BACKEND', '')
    logger.info(f"Lithops v{__version__} - Starting {backend} execution")
    logger.info(f"Execution ID: {task.job_key}/{task.call_id}")

    env = task.extra_env
    env['LITHOPS_CONFIG'] = json.dumps(task.config)
    env['__LITHOPS_SESSION_ID'] = '-'.join([task.job_key, task.call_id])
    os.environ.update(env)

    storage_config = extract_storage_config(task.config)
    internal_storage = InternalStorage(storage_config)
    call_status = create_call_status(task, internal_storage)

    runtime_name = task.runtime_name
    memory = task.runtime_memory
    timeout = task.execution_timeout

    if task.runtime_memory:
        logger.debug(f'Runtime: {runtime_name} - Memory: {memory}MB - Timeout: {timeout} seconds')
    else:
        logger.debug(f'Runtime: {runtime_name} - Timeout: {timeout} seconds')

    job_interruped = False

timeout is the execution timeout - which is 1800 in the json - not sure where it comes from as yet

JosepSampe commented 3 months ago

@RichardScottOZ Note that there are 2 timeouts in Lithops, one is the runtime_timeout specified at the backend level in the config. This timeout is set during the deployment of the runtime, and it will be applied by the cloud provider. A different timeout is the execution_timeout, which is by default 1800s. You have to set this timeout in the lithops section of the config. This timeout, in contrast of the runtime_timeout, is applied by Lithops.

So, based on your logs, you have to put in your config:

lithops:
    execution_timeout: 3600
RichardScottOZ commented 3 months ago

thank you!