lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
315 stars 103 forks source link

aws_batch extract-metadata timeout #1359

Closed RichardScottOZ closed 3 months ago

RichardScottOZ commented 3 months ago

Hi - an aws configuration problem on my end I would think:

However I was wondering if anyone has any insight here?

2024-05-27 23:38:18,585 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
2024-05-27 23:38:20,691 [INFO] aws_batch.py:89 -- AWS Batch client created - Region: us-west-2 - Env: FARGATE_SPOT
READING: books/Calibre Library/
2024-05-27 23:38:21,387 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
['books/Calibre Library/A. P. Klosky/Cold Steel Wardens (22904)/Cold Steel Wardens - A. P. Klosky.pdf', 'books/Calibre Library/A. P. Klosky/Cold Steel Wardens CSW Cover (12-2 (23034)/Cold Steel Wardens CSW Cover (1 - A. P. Klosky.pdf', "books/Calibre Library/A. R. Holmes/T&T - Vaults of K'Horror gm solo a (25119)/T&T - Vaults of K'Horror gm sol - A. R. Holmes.pdf"]
2024-05-27 23:38:30,333 [INFO] config.py:139 -- Lithops v3.3.0 - Python3.10
2024-05-27 23:38:31,088 [INFO] aws_s3.py:59 -- S3 client created - Region: us-west-2
2024-05-27 23:38:33,095 [INFO] aws_batch.py:89 -- AWS Batch client created - Region: us-west-2 - Env: FARGATE_SPOT
2024-05-27 23:38:33,097 [INFO] invokers.py:107 -- ExecutorID a3cd50-1 | JobID M000 - Selected Runtime: book-mentat-runtime-batch:01 - 2048MB
2024-05-27 23:38:33,313 [INFO] invokers.py:115 -- Runtime book-mentat-runtime-batch:01 with 2048MB is not yet deployed
2024-05-27 23:38:33,313 [INFO] aws_batch.py:428 -- Deploying runtime: book-mentat-runtime-batch:01 - Memory: 2048 Timeout: 180
2024-05-27 23:38:34,502 [INFO] aws_batch.py:333 -- Extracting metadata from: book-mentat-runtime-batch:01
Traceback (most recent call last):
  File "/home/richard/book-mentat/src/games/lithops_test_multimap.py", line 105, in <module>
    fexec2.map(process_pdf, plist)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/executors.py", line 254, in map
    runtime_meta = self.invoker.select_runtime(job_id, runtime_memory)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 116, in select_runtime
    runtime_meta = self.compute_handler.deploy_runtime(self.runtime_name, runtime_memory, runtime_timeout)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/serverless.py", line 84, in deploy_runtime
    return self.backend.deploy_runtime(runtime_name, memory, timeout=timeout)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/backends/aws_batch/aws_batch.py", line 433, in deploy_runtime
    runtime_meta = self._generate_runtime_meta(runtime_name, memory)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/backends/aws_batch/aws_batch.py", line 373, in _generate_runtime_meta
    raise Exception('Could not get metadata')
Exception: Could not get metadata

First time using Batch with Lithops - lambda working fine for the same thing.

Thanks very much.

RichardScottOZ commented 3 months ago

config file like this

lithops:
    backend: aws_batch
    storage: aws_s3

aws_batch:
    region : us-west-2
    assign_public_ip: true
    execution_role: arn:aws:iam::ecsTaskExecutionRoleLithops
    instance_role: arn:aws:iam::ecsInstanceRolelithops
    subnets:
        - subnet-
        - subnet-
        - subnet-
        - subnet-
    security_groups:
        - sg-0
    runtime: book-mentat-runtime-batch:01
    runtime_memory: 2048
    worker_processes: 1
    container_vcpus: 1
    service_role: None
    env_total_cpus: 10
    env_type: FARGATE_SPOT

aws:
    region: us-west-2

aws_s3:
    storage_bucket: lithopsdata

with appropriate boring account numbers etc. removed

RichardScottOZ commented 3 months ago

This error which I will have to look into more tomorrow

lithops.storage.utils.StorageNoSuchKeyError: No such key /lithopsdata/book-mentat-runtime-batch:01.meta found in storage.
JosepSampe commented 3 months ago

I suspect the error is a missing IAM role in the job definition. If you put you access and secret keys in the lithops config file, it will probably work, for example:

aws:
    access_key_id : AJUYHAGAUD5541654AHAL5JI3WU5O
    secret_access_key : YHhL4ffgfgl94UvOvkZasdfasdaVGAlgdasfdsfdsfm08+ujj
    region: us-west-2
JosepSampe commented 3 months ago

As suspected, there was a missing role in the job definition. I added a fix for this. Follow the instructions to create the required ecsTaskJobRole role and provide it in the config, so that you don't need to put the AWS credentials in the Lithops config as I suggested before. For example,:

aws_batch:
    execution_role: arn:aws:iam::691987788901:role/ecsTaskExecutionRole 
    instance_role: arn:aws:iam::691987788901:role/ecsInstanceRole
    job_role: arn:aws:iam::691987788901:role/ecsTaskJobRole
    security_groups:
        - sg-6510771e 
    subnets:
        - subnet-7b7eb40d
        - subnet-5935797d
        - subnet-d2b7348a
RichardScottOZ commented 3 months ago

Thanks! Will give that a shot this morning,

RichardScottOZ commented 3 months ago

ok, added that role and the line in aws_batch.py - looks like it is still happening

each time I run it still trying to deploy the runtime I have as a test - so I must have something wrong there - no runtime deploying - so no metadata to pull from the configured s3 bucket?

RichardScottOZ commented 3 months ago

nothing in the bucket that looks like one of those batch job runtime strings as per the code

JosepSampe commented 3 months ago

Can you access the AWS dashboard? If so, you will see the job deployed in the batch service, along with a link to the CloudWatch logs where you will likely see the error.

RichardScottOZ commented 3 months ago

Yes, no Job Job definition and Job Queue look ok

RichardScottOZ commented 3 months ago

I added the keys to test as you suggest above, and different error:

2024-05-28 08:14:29,505 [INFO] invokers.py:174 -- ExecutorID c33c2c-1 | JobID M000 - Starting function invocation: process_pdf() - Total: 7328 activations
Traceback (most recent call last):
  File "/home/richard/book-mentat/src/games/lithops_test_multimap.py", line 105, in <module>
    fexec2.map(process_pdf, plist)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/executors.py", line 276, in map
    futures = self.invoker.run_job(job)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 268, in run_job
    futures = self._run_job(job)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 210, in _run_job
    raise e
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 207, in _run_job
    self._invoke_job(job)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 255, in _invoke_job
    activation_id = self.compute_handler.invoke(payload)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/serverless.py", line 70, in invoke
    return self.backend.invoke(runtime_name, runtime_memory, job_payload)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/backends/aws_batch/aws_batch.py", line 564, in invoke
    self.batch_client.submit_job(
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Error executing request, Exception : Job size must be less than 30 KiB, got 209 KiB, RequestId: 960e7b34-5c19-43dd-a566-9ff4137bc995
JosepSampe commented 3 months ago

The last error is probably because of the size of the payload it is trying to pass for the execution. Can you try running a simple hello world function with: lithops hello -b aws_batch -s aws_s3 -d

EDIT: The iteradata passed to the lithops.map() call looks quite large in your experiment. It tries to execute 7328 functions. Is this ok?

RichardScottOZ commented 3 months ago

Yes, with config keys in .lithops_config that works

RichardScottOZ commented 3 months ago

yes, basically running a map across a list of books - and checking in those jobs for those that failed to process because of lambda timeouts [around 300 or so]- would you handle it differently and sumit many separate things?

RichardScottOZ commented 3 months ago

was used to doing batch array jobs of 10K each, so didn't consider that

RichardScottOZ commented 3 months ago

looks like a no keys hello world is not completing - so I have something wrong - been going for several minutes

RichardScottOZ commented 3 months ago

Ok, now I see Batch jobs - that test failed as per clouodwatch


<html>
<body>
<!--StartFragment-->
2024-05-27T22:56:08.374Z | environ({'HOSTNAME': 'ip-172-31-29-181.us-west-2.compute.internal', 'PYTHON_PIP_VERSION': '23.0.1', 'HOME': '/root', 'GPG_KEY': 'A035C8C19219BA821ECEA86B64E628F8D684696D', 'AWS_EXECUTION_ENV': 'AWS_ECS_FARGATE', 'AWS_BATCH_JOB_ID': '6175f5a5-002b-4451-8127-937a1150ab57', 'ECS_AGENT_URI': 'http://169.254.170.2/api/cde268447a974bd4a429224430fe9520-2470140894', 'AWS_DEFAULT_REGION': 'us-west-2', 'PYTHON_GET_PIP_URL': 'https://github.com/pypa/get-pip/raw/0d8570dc44796f4369b652222cf176b3db6ac70e/public/get-pip.py', 'AWS_BATCH_JQ_NAME': 'lithops_v330_vztq_FARGATE-SPOT_queue', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/cde268447a974bd4a429224430fe9520-2470140894', 'APP_HOME': '/lithops', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/cde268447a974bd4a429224430fe9520-2470140894', 'PATH': '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LANG': 'C.UTF-8', 'AWS_BATCH_JOB_ATTEMPT': '1', '__LITHOPS_PAYLOAD': '{"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 1}, "aws_batch": {"region": "us-west-2", "assign_public_ip": true, "execution_role": "arn:aws:iam::958539196701:role/ecsTaskExecutionRoleLithops", "instance_role": "arn:aws:iam::958539196701:role/ecsInstanceRolelithops", "job_role": "arn:aws:iam::958539196701:role/ecsTaskJobRolelithops", "subnets": ["subnet-70515309", "subnet-4ad6fa01", "subnet-4a342910", "subnet-bd069396"], "security_groups": ["sg-0eabf99e212627078"], "runtime": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "worker_processes": 1, "container_vcpus": 1, "service_role": "None", "env_total_cpus": 10, "env_type": "FARGATE_SPOT", "runtime_timeout": 180, "env_max_cpus": 10, "max_workers": 10, "user_agent": "lithops/3.3.0"}, "aws": {"region": "us-west-2"}, "aws_s3": {"storage_bucket": "lithops-data-books", "region": "us-west-2", "user_agent": "lithops/3.3.0"}}, "chunksize": 1, "log_level": 20, "func_name": "hello", "func_key": "lithops.jobs/2c372d-0/7b42ce6f22352d1a3c07c803020bd434.func.pickle", "data_key": "lithops.jobs/2c372d-0-A000/aggdata.pickle", "extra_env": {}, "total_calls": 1, "execution_timeout": 175, "data_byte_ranges": [[0, 31]], "executor_id": "2c372d-0", "job_id": "A000", "job_key": "2c372d-0-A000", "max_workers": 10, "call_ids": ["00000"], "host_submit_tstamp": 1716850517.8832364, "lithops_version": "3.3.0", "runtime_name": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "worker_processes": 1}', 'PYTHON_VERSION': '3.10.12', 'PYTHON_SETUPTOOLS_VERSION': '65.5.1', 'AWS_REGION': 'us-west-2', '__LITHOPS_ACTION': 'run_job', 'PWD': '/lithops', 'PYTHON_GET_PIP_SHA256': '96461deced5c2a487ddc65207ec5a9cffeca0d34e7af7ea1afc470ff0d746207', 'AWS_BATCH_CE_NAME': 'lithops_v330_vztq_FARGATE-SPOT_env'})
-- | --
  | 2024-05-27T22:56:08.374Z | File "/lithops/entry_point.py", line 69, in <module>
  | 2024-05-27T22:56:08.374Z | function_handler(lithops_payload)
  | 2024-05-27T22:56:08.374Z | File "/lithops/lithops/worker/handler.py", line 71, in function_handler
  | 2024-05-27T22:56:08.375Z | job = create_job(payload)
  | 2024-05-27T22:56:08.375Z | File "/lithops/lithops/worker/handler.py", line 60, in create_job
  | 2024-05-27T22:56:08.375Z | internal_storage = InternalStorage(storage_config)
  | 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/storage.py", line 352, in __init__
  | 2024-05-27T22:56:08.375Z | self.storage.create_bucket(self.bucket)
  | 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/storage.py", line 98, in create_bucket
  | 2024-05-27T22:56:08.375Z | return self.storage_handler.create_bucket(bucket)
  | 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/backends/aws_s3/aws_s3.py", line 84, in create_bucket
  | 2024-05-27T22:56:08.375Z | self.s3_client.head_bucket(Bucket=bucket_name)
  | 2024-05-27T22:56:08.375Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
  | 2024-05-27T22:56:08.375Z | return self._make_api_call(operation_name, kwargs)
  | 2024-05-27T22:56:08.375Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1001, in _make_api_call
  | 2024-05-27T22:56:08.376Z | http, parsed_response = self._make_request(
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1027, in _make_request
  | 2024-05-27T22:56:08.376Z | return self._endpoint.make_request(operation_model, request_dict)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
  | 2024-05-27T22:56:08.376Z | return self._send_request(request_dict, operation_model)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
  | 2024-05-27T22:56:08.376Z | request = self.create_request(request_dict, operation_model)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
  | 2024-05-27T22:56:08.376Z | self._event_emitter.emit(
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
  | 2024-05-27T22:56:08.376Z | return self._emitter.emit(aliased_event_name, **kwargs)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
  | 2024-05-27T22:56:08.376Z | return self._emit(event_name, kwargs)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
  | 2024-05-27T22:56:08.376Z | response = handler(**kwargs)
  | 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
  | 2024-05-27T22:56:08.377Z | return self.sign(operation_name, request)
  | 2024-05-27T22:56:08.377Z | File "/usr/local/lib/python3.10/site-packages/botocore/signers.py", line 199, in sign
  | 2024-05-27T22:56:08.377Z | auth.add_auth(request)
  | 2024-05-27T22:56:08.377Z | File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
  | 2024-05-27T22:56:08.377Z | raise NoCredentialsError()
  | 2024-05-27T22:56:08.377Z | botocore.exceptions.NoCredentialsError: Unable to locate credentials

<!--EndFragment-->
</body>
</html>
JosepSampe commented 3 months ago

In this case it is a permissions issue, in principle the patch I added should fix it. Can you make sure the JobARN in the job definition is correctly set? for example:

RichardScottOZ commented 3 months ago

ok, will check

RichardScottOZ commented 3 months ago

looks like not

making a new one with your code addition likely would?

JosepSampe commented 3 months ago

making a new one with your code addition likely would?

yes, my addition must add it always

JosepSampe commented 3 months ago

Can you run lithops clean -b aws_batch -s aws_s3 -d? This will cleanup all the lithops job definitions you deployed previous my patch. And then run again: lithops hello -b aws_batch -s aws_s3 -d

RichardScottOZ commented 3 months ago

yes, will do - the test might have to wait until I get back this arvo will see how it goes

RichardScottOZ commented 3 months ago

the teardown and rebuild worked, thanks [been in transit] - so now presumably just code problems - I will try a test just sending a few through and check that works when I get home from what I was trying and go from there

RichardScottOZ commented 3 months ago

Framework worked on that test

2024-05-28 17:00:40,367 [INFO] wait.py:101 -- ExecutorID cbd67f-1 - Waiting for 10 function activations to complete

  100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10

2024-05-28 17:01:39,690 [INFO] executors.py:610 -- ExecutorID cbd67f-1 - Cleaning temporary data

Seems like magic when that happens, so thank you!