Closed RichardScottOZ closed 3 months ago
config file like this
lithops:
backend: aws_batch
storage: aws_s3
aws_batch:
region : us-west-2
assign_public_ip: true
execution_role: arn:aws:iam::ecsTaskExecutionRoleLithops
instance_role: arn:aws:iam::ecsInstanceRolelithops
subnets:
- subnet-
- subnet-
- subnet-
- subnet-
security_groups:
- sg-0
runtime: book-mentat-runtime-batch:01
runtime_memory: 2048
worker_processes: 1
container_vcpus: 1
service_role: None
env_total_cpus: 10
env_type: FARGATE_SPOT
aws:
region: us-west-2
aws_s3:
storage_bucket: lithopsdata
with appropriate boring account numbers etc. removed
This error which I will have to look into more tomorrow
lithops.storage.utils.StorageNoSuchKeyError: No such key /lithopsdata/book-mentat-runtime-batch:01.meta found in storage.
I suspect the error is a missing IAM role in the job definition. If you put you access and secret keys in the lithops config file, it will probably work, for example:
aws:
access_key_id : AJUYHAGAUD5541654AHAL5JI3WU5O
secret_access_key : YHhL4ffgfgl94UvOvkZasdfasdaVGAlgdasfdsfdsfm08+ujj
region: us-west-2
As suspected, there was a missing role in the job definition. I added a fix for this. Follow the instructions to create the required ecsTaskJobRole
role and provide it in the config, so that you don't need to put the AWS credentials in the Lithops config as I suggested before. For example,:
aws_batch:
execution_role: arn:aws:iam::691987788901:role/ecsTaskExecutionRole
instance_role: arn:aws:iam::691987788901:role/ecsInstanceRole
job_role: arn:aws:iam::691987788901:role/ecsTaskJobRole
security_groups:
- sg-6510771e
subnets:
- subnet-7b7eb40d
- subnet-5935797d
- subnet-d2b7348a
Thanks! Will give that a shot this morning,
ok, added that role and the line in aws_batch.py - looks like it is still happening
each time I run it still trying to deploy the runtime I have as a test - so I must have something wrong there - no runtime deploying - so no metadata to pull from the configured s3 bucket?
nothing in the bucket that looks like one of those batch job runtime strings as per the code
Can you access the AWS dashboard? If so, you will see the job deployed in the batch service, along with a link to the CloudWatch logs where you will likely see the error.
Yes, no Job Job definition and Job Queue look ok
I added the keys to test as you suggest above, and different error:
2024-05-28 08:14:29,505 [INFO] invokers.py:174 -- ExecutorID c33c2c-1 | JobID M000 - Starting function invocation: process_pdf() - Total: 7328 activations
Traceback (most recent call last):
File "/home/richard/book-mentat/src/games/lithops_test_multimap.py", line 105, in <module>
fexec2.map(process_pdf, plist)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/executors.py", line 276, in map
futures = self.invoker.run_job(job)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 268, in run_job
futures = self._run_job(job)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 210, in _run_job
raise e
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 207, in _run_job
self._invoke_job(job)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/invokers.py", line 255, in _invoke_job
activation_id = self.compute_handler.invoke(payload)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/serverless.py", line 70, in invoke
return self.backend.invoke(runtime_name, runtime_memory, job_payload)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/lithops/serverless/backends/aws_batch/aws_batch.py", line 564, in invoke
self.batch_client.submit_job(
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/richard/miniconda3/envs/mentat/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Error executing request, Exception : Job size must be less than 30 KiB, got 209 KiB, RequestId: 960e7b34-5c19-43dd-a566-9ff4137bc995
The last error is probably because of the size of the payload it is trying to pass for the execution.
Can you try running a simple hello world function with: lithops hello -b aws_batch -s aws_s3 -d
EDIT:
The iteradata
passed to the lithops.map() call looks quite large in your experiment. It tries to execute 7328 functions. Is this ok?
Yes, with config keys in .lithops_config that works
yes, basically running a map across a list of books - and checking in those jobs for those that failed to process because of lambda timeouts [around 300 or so]- would you handle it differently and sumit many separate things?
was used to doing batch array jobs of 10K each, so didn't consider that
looks like a no keys hello world is not completing - so I have something wrong - been going for several minutes
Ok, now I see Batch jobs - that test failed as per clouodwatch
<html>
<body>
<!--StartFragment-->
2024-05-27T22:56:08.374Z | environ({'HOSTNAME': 'ip-172-31-29-181.us-west-2.compute.internal', 'PYTHON_PIP_VERSION': '23.0.1', 'HOME': '/root', 'GPG_KEY': 'A035C8C19219BA821ECEA86B64E628F8D684696D', 'AWS_EXECUTION_ENV': 'AWS_ECS_FARGATE', 'AWS_BATCH_JOB_ID': '6175f5a5-002b-4451-8127-937a1150ab57', 'ECS_AGENT_URI': 'http://169.254.170.2/api/cde268447a974bd4a429224430fe9520-2470140894', 'AWS_DEFAULT_REGION': 'us-west-2', 'PYTHON_GET_PIP_URL': 'https://github.com/pypa/get-pip/raw/0d8570dc44796f4369b652222cf176b3db6ac70e/public/get-pip.py', 'AWS_BATCH_JQ_NAME': 'lithops_v330_vztq_FARGATE-SPOT_queue', 'ECS_CONTAINER_METADATA_URI_V4': 'http://169.254.170.2/v4/cde268447a974bd4a429224430fe9520-2470140894', 'APP_HOME': '/lithops', 'ECS_CONTAINER_METADATA_URI': 'http://169.254.170.2/v3/cde268447a974bd4a429224430fe9520-2470140894', 'PATH': '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'LANG': 'C.UTF-8', 'AWS_BATCH_JOB_ATTEMPT': '1', '__LITHOPS_PAYLOAD': '{"config": {"lithops": {"backend": "aws_batch", "storage": "aws_s3", "mode": "serverless", "monitoring": "storage", "execution_timeout": 1800, "chunksize": 1}, "aws_batch": {"region": "us-west-2", "assign_public_ip": true, "execution_role": "arn:aws:iam::958539196701:role/ecsTaskExecutionRoleLithops", "instance_role": "arn:aws:iam::958539196701:role/ecsInstanceRolelithops", "job_role": "arn:aws:iam::958539196701:role/ecsTaskJobRolelithops", "subnets": ["subnet-70515309", "subnet-4ad6fa01", "subnet-4a342910", "subnet-bd069396"], "security_groups": ["sg-0eabf99e212627078"], "runtime": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "worker_processes": 1, "container_vcpus": 1, "service_role": "None", "env_total_cpus": 10, "env_type": "FARGATE_SPOT", "runtime_timeout": 180, "env_max_cpus": 10, "max_workers": 10, "user_agent": "lithops/3.3.0"}, "aws": {"region": "us-west-2"}, "aws_s3": {"storage_bucket": "lithops-data-books", "region": "us-west-2", "user_agent": "lithops/3.3.0"}}, "chunksize": 1, "log_level": 20, "func_name": "hello", "func_key": "lithops.jobs/2c372d-0/7b42ce6f22352d1a3c07c803020bd434.func.pickle", "data_key": "lithops.jobs/2c372d-0-A000/aggdata.pickle", "extra_env": {}, "total_calls": 1, "execution_timeout": 175, "data_byte_ranges": [[0, 31]], "executor_id": "2c372d-0", "job_id": "A000", "job_key": "2c372d-0-A000", "max_workers": 10, "call_ids": ["00000"], "host_submit_tstamp": 1716850517.8832364, "lithops_version": "3.3.0", "runtime_name": "book-mentat-runtime-batch:01", "runtime_memory": 2048, "worker_processes": 1}', 'PYTHON_VERSION': '3.10.12', 'PYTHON_SETUPTOOLS_VERSION': '65.5.1', 'AWS_REGION': 'us-west-2', '__LITHOPS_ACTION': 'run_job', 'PWD': '/lithops', 'PYTHON_GET_PIP_SHA256': '96461deced5c2a487ddc65207ec5a9cffeca0d34e7af7ea1afc470ff0d746207', 'AWS_BATCH_CE_NAME': 'lithops_v330_vztq_FARGATE-SPOT_env'})
-- | --
| 2024-05-27T22:56:08.374Z | File "/lithops/entry_point.py", line 69, in <module>
| 2024-05-27T22:56:08.374Z | function_handler(lithops_payload)
| 2024-05-27T22:56:08.374Z | File "/lithops/lithops/worker/handler.py", line 71, in function_handler
| 2024-05-27T22:56:08.375Z | job = create_job(payload)
| 2024-05-27T22:56:08.375Z | File "/lithops/lithops/worker/handler.py", line 60, in create_job
| 2024-05-27T22:56:08.375Z | internal_storage = InternalStorage(storage_config)
| 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/storage.py", line 352, in __init__
| 2024-05-27T22:56:08.375Z | self.storage.create_bucket(self.bucket)
| 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/storage.py", line 98, in create_bucket
| 2024-05-27T22:56:08.375Z | return self.storage_handler.create_bucket(bucket)
| 2024-05-27T22:56:08.375Z | File "/lithops/lithops/storage/backends/aws_s3/aws_s3.py", line 84, in create_bucket
| 2024-05-27T22:56:08.375Z | self.s3_client.head_bucket(Bucket=bucket_name)
| 2024-05-27T22:56:08.375Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
| 2024-05-27T22:56:08.375Z | return self._make_api_call(operation_name, kwargs)
| 2024-05-27T22:56:08.375Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1001, in _make_api_call
| 2024-05-27T22:56:08.376Z | http, parsed_response = self._make_request(
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/client.py", line 1027, in _make_request
| 2024-05-27T22:56:08.376Z | return self._endpoint.make_request(operation_model, request_dict)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
| 2024-05-27T22:56:08.376Z | return self._send_request(request_dict, operation_model)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
| 2024-05-27T22:56:08.376Z | request = self.create_request(request_dict, operation_model)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
| 2024-05-27T22:56:08.376Z | self._event_emitter.emit(
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
| 2024-05-27T22:56:08.376Z | return self._emitter.emit(aliased_event_name, **kwargs)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
| 2024-05-27T22:56:08.376Z | return self._emit(event_name, kwargs)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
| 2024-05-27T22:56:08.376Z | response = handler(**kwargs)
| 2024-05-27T22:56:08.376Z | File "/usr/local/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
| 2024-05-27T22:56:08.377Z | return self.sign(operation_name, request)
| 2024-05-27T22:56:08.377Z | File "/usr/local/lib/python3.10/site-packages/botocore/signers.py", line 199, in sign
| 2024-05-27T22:56:08.377Z | auth.add_auth(request)
| 2024-05-27T22:56:08.377Z | File "/usr/local/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
| 2024-05-27T22:56:08.377Z | raise NoCredentialsError()
| 2024-05-27T22:56:08.377Z | botocore.exceptions.NoCredentialsError: Unable to locate credentials
<!--EndFragment-->
</body>
</html>
In this case it is a permissions issue, in principle the patch I added should fix it. Can you make sure the JobARN in the job definition is correctly set? for example:
ok, will check
looks like not
making a new one with your code addition likely would?
making a new one with your code addition likely would?
yes, my addition must add it always
Can you run lithops clean -b aws_batch -s aws_s3 -d
?
This will cleanup all the lithops job definitions you deployed previous my patch.
And then run again: lithops hello -b aws_batch -s aws_s3 -d
yes, will do - the test might have to wait until I get back this arvo will see how it goes
the teardown and rebuild worked, thanks [been in transit] - so now presumably just code problems - I will try a test just sending a few through and check that works when I get home from what I was trying and go from there
Framework worked on that test
2024-05-28 17:00:40,367 [INFO] wait.py:101 -- ExecutorID cbd67f-1 - Waiting for 10 function activations to complete
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10
2024-05-28 17:01:39,690 [INFO] executors.py:610 -- ExecutorID cbd67f-1 - Cleaning temporary data
Seems like magic when that happens, so thank you!
Hi - an aws configuration problem on my end I would think:
However I was wondering if anyone has any insight here?
First time using Batch with Lithops - lambda working fine for the same thing.
Thanks very much.