lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

Problem with Redis get_function_data() in version 3.0.1.dev0 #1162

Closed RaulGracia closed 11 months ago

RaulGracia commented 1 year ago

Description of the problem Consider the following Lithops job:

from lithops import FunctionExecutor
from lithops import get_result

def double(i):
    return i * 2

if __name__ == '__main__':
    fexec_local = FunctionExecutor(backend='aws_lambda', storage='aws_s3')
    futures_local = fexec_local.map(double, [1, 2, 3, 4])
    fexec_cluster = FunctionExecutor(backend='k8s', storage='redis')
    futures_cluster = fexec_cluster.map(double, [5, 6, 7, 8])
    print(get_result(futures_local + futures_cluster))

As you can see, it just shows how we can join futures of processing tasks being executed in 2 different infrastructures. This was working fine few weeks back with version 2.9.1.dev0.

However, after upgrading to the latest release (3.0.1.dev0), the job is stuck waiting for tasks to complete, specifically for the K8s tasks that use Redis as storage:

python examples/multi-cloud-computation.py
2023-09-28 16:27:41,060 [INFO] config.py:141 -- Lithops v3.0.1.dev0 - Python3.10
2023-09-28 16:27:41,172 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 16:27:42,447 [INFO] aws_lambda.py:94 -- AWS Lambda client created - Region: us-east-2
2023-09-28 16:27:42,448 [INFO] invokers.py:107 -- ExecutorID ddb83d-0 | JobID M000 - Selected Runtime: lithops-default-runtime-v310 - 256MB
2023-09-28 16:27:42,691 [INFO] invokers.py:171 -- ExecutorID ddb83d-0 | JobID M000 - Starting function invocation: double() - Total: 2 activations
2023-09-28 16:27:42,694 [INFO] invokers.py:207 -- ExecutorID ddb83d-0 | JobID M000 - View execution logs at /tmp/lithops-raul/logs/ddb83d-0-M000.log
2023-09-28 16:27:42,694 [INFO] config.py:141 -- Lithops v3.0.1.dev0 - Python3.10
2023-09-28 16:27:42,896 [INFO] redis.py:43 -- Redis client created - Host: xxx
2023-09-28 16:27:42,908 [INFO] k8s.py:96 -- Kubernetes client created - Namespace: default
2023-09-28 16:27:42,908 [INFO] invokers.py:107 -- ExecutorID ddb83d-1 | JobID M000 - Selected Runtime: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 - 512MB
2023-09-28 16:27:43,989 [INFO] invokers.py:115 -- Runtime docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 with 512MB is not yet deployed
2023-09-28 16:27:43,989 [INFO] k8s.py:224 -- Deploying runtime: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 - Memory: 512 Timeout: 600
2023-09-28 16:27:43,990 [INFO] k8s.py:411 -- Extracting metadata from: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0
2023-09-28 16:28:07,563 [INFO] invokers.py:171 -- ExecutorID ddb83d-1 | JobID M000 - Starting function invocation: double() - Total: 2 activations
2023-09-28 16:28:08,820 [INFO] invokers.py:207 -- ExecutorID ddb83d-1 | JobID M000 - View execution logs at /tmp/lithops-raul/logs/ddb83d-1-M000.log
2023-09-28 16:28:08,820 [INFO] wait.py:98 -- ExecutorID ddb83d-0 - Getting results from 4 function activations

    0%|                                                                                                                                                                                                                   | 0/4  2023-09-28 16:28:08,828 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 16:28:09,685 [INFO] redis.py:43 -- Redis client created - Host: xxx
   75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                    | 3/4     75%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                    | 3/4 

Looking at the logs of the functions executed in K8s, I see the following:

2023-09-28 14:51:14,775 [INFO] lithopsentry.py:81 -- Lithops v3.0.1.dev0 - Starting kubernetes execution
2023-09-28 14:51:14,817 [INFO] lithopsentry.py:111 -- Received range: 0-1
2023-09-28 14:51:14,818 [INFO] handler.py:60 -- Tasks received: 1 - Worker processes: 1
2023-09-28 14:51:14,850 [INFO] redis.py:43 -- Redis client created - Host: xxxx
2023-09-28 14:51:14,863 [INFO] handler.py:152 -- Lithops v3.0.1.dev0 - Starting k8s execution
2023-09-28 14:51:14,863 [INFO] handler.py:153 -- Execution ID: 4a18c2-1-M000/00000
2023-09-28 14:51:14,864 [INFO] redis.py:43 -- Redis client created - Host: xxxx
2023-09-28 14:51:14,880 [INFO] jobrunner.py:233 -- Going to execute 'double()'
---------------------- FUNCTION LOG ----------------------
----------------------------------------------------------
2023-09-28 14:51:14,880 [INFO] jobrunner.py:239 -- Success function execution
2023-09-28 14:51:14,882 [INFO] jobrunner.py:311 -- Process finished
2023-09-28 14:51:14,886 [INFO] status.py:88 -- Storing execution stats - Size: 998.0B
2023-09-28 14:51:14,888 [INFO] handler.py:243 -- Finished
2023-09-28 14:51:14,896 [INFO] lithopsentry.py:111 -- Received range: 1-2
2023-09-28 14:51:14,897 [INFO] handler.py:60 -- Tasks received: 0 - Worker processes: 0
2023-09-28 14:51:14,898 [INFO] redis.py:43 -- Redis client created - Host: xxxx
Traceback (most recent call last):
  File "/lithops/lithopsentry.py", line 140, in <module>
    func()
  File "/lithops/lithopsentry.py", line 121, in run_job
    function_handler(payload)
  File "/lithops/lithops/worker/handler.py", line 70, in function_handler
    job_data = get_function_data(job, internal_storage)
  File "/lithops/lithops/worker/utils.py", line 93, in get_function_data
    init_byte = job.data_byte_ranges[0][0]
IndexError: list index out of range

Could the problem be related with how the result data is being collected from redis to be reported back to the master node? Any help is appreciated, thanks!

JosepSampe commented 1 year ago

Hi @RaulGracia. Could you activate debug logs? Do you have the same problem if you use aws_s3 instead of redis?

if __name__ == '__main__':
    fexec_local = FunctionExecutor(backend='aws_lambda', storage='aws_s3', log_level='DEBUG')
    futures_local = fexec_local.map(double, [1, 2, 3, 4])
    fexec_cluster = FunctionExecutor(backend='k8s', storage='aws_s3', log_level='DEBUG')
    futures_cluster = fexec_cluster.map(double, [5, 6, 7, 8])
    print(get_result(futures_local + futures_cluster))
RaulGracia commented 1 year ago

Hi @JosepSampe, it seems to only fails with K8s + Redis. When I execute the same job, but commenting the part related to Kubernetes + Redis, it works well (S3 only). As you can see, the tasks that fail come from the K8s + Redis cluster (ExecutorID 3e2651-1):

2023-09-28 17:17:05,661 [INFO] config.py:141 -- Lithops v3.0.1.dev0 - Python3.10
2023-09-28 17:17:05,661 [DEBUG] config.py:103 -- Loading configuration from /home/raul/.lithops/config
2023-09-28 17:17:05,672 [DEBUG] config.py:193 -- Loading Serverless backend module: aws_lambda
2023-09-28 17:17:05,717 [DEBUG] config.py:235 -- Loading Storage backend module: aws_s3
2023-09-28 17:17:05,718 [DEBUG] aws_s3.py:37 -- Creating S3 client
2023-09-28 17:17:05,782 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 17:17:07,124 [DEBUG] aws_lambda.py:52 -- Creating AWS Lambda client
2023-09-28 17:17:07,125 [DEBUG] aws_lambda.py:65 -- Creating Boto3 AWS Session and Lambda Client
2023-09-28 17:17:07,822 [INFO] aws_lambda.py:94 -- AWS Lambda client created - Region: us-east-2
2023-09-28 17:17:07,824 [DEBUG] invokers.py:93 -- ExecutorID 3e2651-0 - Invoker initialized. Max workers: 1000
2023-09-28 17:17:07,824 [DEBUG] invokers.py:292 -- ExecutorID 3e2651-0 - Serverless invoker created
2023-09-28 17:17:07,824 [DEBUG] executors.py:145 -- Function executor for aws_lambda created with ID: 3e2651-0
2023-09-28 17:17:07,824 [INFO] invokers.py:107 -- ExecutorID 3e2651-0 | JobID M000 - Selected Runtime: lithops-default-runtime-v310 - 256MB
2023-09-28 17:17:07,824 [DEBUG] storage.py:431 -- Runtime metadata found in local disk cache
2023-09-28 17:17:07,824 [DEBUG] job.py:234 -- ExecutorID 3e2651-0 | JobID M000 - Serializing function and data
2023-09-28 17:17:07,825 [DEBUG] serialize.py:84 -- Referenced modules: None
2023-09-28 17:17:07,825 [DEBUG] serialize.py:101 -- Modules to transmit: None
2023-09-28 17:17:07,825 [DEBUG] job.py:271 -- ExecutorID 3e2651-0 | JobID M000 - Uploading function and modules to the storage backend
2023-09-28 17:17:08,095 [DEBUG] aws_s3.py:104 -- PUT Object lithops.jobs/3e2651-0/d83fdacba0a250bba12265fecbcdda4d.func.pickle - Size: 585.0B - OK
2023-09-28 17:17:08,095 [DEBUG] job.py:311 -- ExecutorID 3e2651-0 | JobID M000 - Data per activation is < 8.0KiB. Passing data through invocation payload
2023-09-28 17:17:08,095 [INFO] invokers.py:171 -- ExecutorID 3e2651-0 | JobID M000 - Starting function invocation: double() - Total: 4 activations
2023-09-28 17:17:08,096 [DEBUG] invokers.py:176 -- ExecutorID 3e2651-0 | JobID M000 - Worker processes: 1 - Chunksize: 1
2023-09-28 17:17:08,096 [DEBUG] invokers.py:301 -- ExecutorID 3e2651-0 - Async invoker 0 started
2023-09-28 17:17:08,096 [DEBUG] invokers.py:301 -- ExecutorID 3e2651-0 - Async invoker 1 started
2023-09-28 17:17:08,096 [DEBUG] invokers.py:430 -- ExecutorID 3e2651-0 | JobID M000 - Free workers: 1000 - Going to run 4 activations in 4 workers
2023-09-28 17:17:08,100 [INFO] invokers.py:207 -- ExecutorID 3e2651-0 | JobID M000 - View execution logs at /tmp/lithops-raul/logs/3e2651-0-M000.log
2023-09-28 17:17:08,101 [DEBUG] monitor.py:381 -- ExecutorID 3e2651-0 - Starting Storage job monitor
2023-09-28 17:17:08,103 [INFO] config.py:141 -- Lithops v3.0.1.dev0 - Python3.10
2023-09-28 17:17:08,104 [DEBUG] config.py:103 -- Loading configuration from /home/raul/.lithops/config
2023-09-28 17:17:08,107 [DEBUG] config.py:193 -- Loading Serverless backend module: k8s
2023-09-28 17:17:08,299 [DEBUG] config.py:235 -- Loading Storage backend module: redis
2023-09-28 17:17:08,349 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-28 17:17:08,350 [INFO] redis.py:43 -- Redis client created - Host: xxxx
2023-09-28 17:17:08,350 [DEBUG] k8s.py:49 -- Creating Kubernetes client
2023-09-28 17:17:08,350 [DEBUG] k8s.py:64 -- Loading kubeconfig file: /home/raul/.kube/config
2023-09-28 17:17:08,375 [DEBUG] k8s.py:74 -- Using kubeconfig conetxt: lady-ursula - cluster: lady-ursula
2023-09-28 17:17:08,378 [INFO] k8s.py:96 -- Kubernetes client created - Namespace: default
2023-09-28 17:17:08,379 [DEBUG] invokers.py:93 -- ExecutorID 3e2651-1 - Invoker initialized. Max workers: 100
2023-09-28 17:17:08,379 [DEBUG] executors.py:145 -- Function executor for k8s created with ID: 3e2651-1
2023-09-28 17:17:08,379 [INFO] invokers.py:107 -- ExecutorID 3e2651-1 | JobID M000 - Selected Runtime: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 - 512MB
2023-09-28 17:17:08,380 [DEBUG] storage.py:431 -- Runtime metadata found in local disk cache
2023-09-28 17:17:08,381 [DEBUG] job.py:234 -- ExecutorID 3e2651-1 | JobID M000 - Serializing function and data
2023-09-28 17:17:08,383 [DEBUG] serialize.py:84 -- Referenced modules: None
2023-09-28 17:17:08,384 [DEBUG] serialize.py:101 -- Modules to transmit: None
2023-09-28 17:17:08,384 [DEBUG] job.py:271 -- ExecutorID 3e2651-1 | JobID M000 - Uploading function and modules to the storage backend
2023-09-28 17:17:09,186 [DEBUG] invokers.py:376 -- ExecutorID 3e2651-0 | JobID M000 - Calls 00000 invoked (1.089s) - Activation ID: a3fdc592-33b9-4733-9870-d47267c5216d
2023-09-28 17:17:09,241 [DEBUG] invokers.py:376 -- ExecutorID 3e2651-0 | JobID M000 - Calls 00002 invoked (1.142s) - Activation ID: 2b14f057-56eb-4203-8708-f69da61e3467
2023-09-28 17:17:09,292 [DEBUG] invokers.py:376 -- ExecutorID 3e2651-0 | JobID M000 - Calls 00001 invoked (1.194s) - Activation ID: c305c7bb-0d59-4af6-8fae-2f411287ce6b
2023-09-28 17:17:09,316 [DEBUG] invokers.py:376 -- ExecutorID 3e2651-0 | JobID M000 - Calls 00003 invoked (1.217s) - Activation ID: 0b30b233-18da-47d9-8285-af9d6f2658d6
2023-09-28 17:17:09,403 [DEBUG] job.py:297 -- ExecutorID 3e2651-1 | JobID M000 - Uploading data to the storage backend
2023-09-28 17:17:09,607 [INFO] invokers.py:171 -- ExecutorID 3e2651-1 | JobID M000 - Starting function invocation: double() - Total: 4 activations
2023-09-28 17:17:09,607 [DEBUG] invokers.py:176 -- ExecutorID 3e2651-1 | JobID M000 - Worker processes: 1 - Chunksize: 1
2023-09-28 17:17:10,466 [DEBUG] k8s.py:304 -- Starting Lithops master Pod
2023-09-28 17:17:11,308 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-0 - Pending: 0 - Running: 0 - Done: 4
2023-09-28 17:17:11,308 [DEBUG] monitor.py:409 -- ExecutorID 3e2651-0 - Storage job monitor finished
2023-09-28 17:17:11,710 [DEBUG] k8s.py:339 -- Waiting Lithops master pod to be ready
2023-09-28 17:17:15,422 [DEBUG] k8s.py:391 -- ExecutorID 3e2651-1 | JobID M000 - Going to run 4 activations in 4 workers
2023-09-28 17:17:15,841 [DEBUG] invokers.py:252 -- ExecutorID 3e2651-1 | JobID M000 - Job invoked (6.233s) - Activation ID: lithops-3e2651-1-m000
2023-09-28 17:17:15,841 [INFO] invokers.py:207 -- ExecutorID 3e2651-1 | JobID M000 - View execution logs at /tmp/lithops-raul/logs/3e2651-1-M000.log
2023-09-28 17:17:15,841 [DEBUG] monitor.py:381 -- ExecutorID 3e2651-1 - Starting Storage job monitor
2023-09-28 17:17:15,841 [INFO] wait.py:98 -- ExecutorID 3e2651-0 - Getting results from 8 function activations
2023-09-28 17:17:15,841 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-28 17:17:15,841 [INFO] redis.py:43 -- Redis client created - Host: xxxx
2023-09-28 17:17:15,841 [DEBUG] aws_s3.py:37 -- Creating S3 client
2023-09-28 17:17:15,845 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 17:17:16,535 [DEBUG] monitor.py:381 -- ExecutorID 3e2651-1 - Starting Storage job monitor
2023-09-28 17:17:16,535 [DEBUG] monitor.py:381 -- ExecutorID 3e2651-0 - Starting Storage job monitor
2023-09-28 17:17:16,537 [DEBUG] monitor.py:409 -- ExecutorID 3e2651-0 - Storage job monitor finished
2023-09-28 17:17:16,537 [DEBUG] future.py:270 -- ExecutorID 3e2651-0 | JobID M000 - Got status from call 00000 - Activation ID: a3fdc592-33b9-4733-9870-d47267c5216d - Time: 0.30 seconds
2023-09-28 17:17:16,537 [DEBUG] future.py:290 -- ExecutorID 3e2651-0 | JobID M000 - Got output from call 00000 - Activation ID: a3fdc592-33b9-4733-9870-d47267c5216d
2023-09-28 17:17:16,537 [DEBUG] future.py:270 -- ExecutorID 3e2651-0 | JobID M000 - Got status from call 00001 - Activation ID: c305c7bb-0d59-4af6-8fae-2f411287ce6b - Time: 0.28 seconds
2023-09-28 17:17:16,537 [DEBUG] future.py:290 -- ExecutorID 3e2651-0 | JobID M000 - Got output from call 00001 - Activation ID: c305c7bb-0d59-4af6-8fae-2f411287ce6b
2023-09-28 17:17:16,538 [DEBUG] future.py:270 -- ExecutorID 3e2651-0 | JobID M000 - Got status from call 00002 - Activation ID: 2b14f057-56eb-4203-8708-f69da61e3467 - Time: 0.24 seconds
2023-09-28 17:17:16,538 [DEBUG] future.py:290 -- ExecutorID 3e2651-0 | JobID M000 - Got output from call 00002 - Activation ID: 2b14f057-56eb-4203-8708-f69da61e3467
2023-09-28 17:17:16,538 [DEBUG] future.py:270 -- ExecutorID 3e2651-0 | JobID M000 - Got status from call 00003 - Activation ID: 0b30b233-18da-47d9-8285-af9d6f2658d6 - Time: 0.28 seconds
2023-09-28 17:17:16,538 [DEBUG] future.py:290 -- ExecutorID 3e2651-0 | JobID M000 - Got output from call 00003 - Activation ID: 0b30b233-18da-47d9-8285-af9d6f2658d6
2023-09-28 17:17:19,491 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 4 - Running: 0 - Done: 0
2023-09-28 17:17:21,029 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 4 - Running: 0 - Done: 0
2023-09-28 17:17:22,723 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:17:23,550 [DEBUG] future.py:270 -- ExecutorID 3e2651-1 | JobID M000 - Got status from call 00000 - Activation ID: 54f1011983a0 - Time: 0.02 seconds
2023-09-28 17:17:23,550 [DEBUG] future.py:290 -- ExecutorID 3e2651-1 | JobID M000 - Got output from call 00000 - Activation ID: 54f1011983a0
2023-09-28 17:17:23,551 [DEBUG] future.py:270 -- ExecutorID 3e2651-1 | JobID M000 - Got status from call 00002 - Activation ID: 59f09acbfcf0 - Time: 0.02 seconds
2023-09-28 17:17:23,551 [DEBUG] future.py:290 -- ExecutorID 3e2651-1 | JobID M000 - Got output from call 00002 - Activation ID: 59f09acbfcf0
2023-09-28 17:17:24,280 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:17:52,233 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:18:12,917 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:18:20,178 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:18:48,165 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:19:04,587 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:19:16,070 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:19:44,060 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
2023-09-28 17:19:56,278 [DEBUG] monitor.py:127 -- ExecutorID 3e2651-1 - Pending: 2 - Running: 0 - Done: 2
^C2023-09-28 17:20:10,495 [INFO] wait.py:156 -- Cancelled - Total Activations not done: 2
Traceback (most recent call last):
  File "/home/raul/Documents/workspace/lithops-sdp-integration/lithops/examples/multi-cloud-computation.py", line 14, in <module>
    print(get_result(futures_aws + futures_k8s))
  File "/home/raul/Documents/workspace/lithops-sdp-integration/lithops-venv/lib/python3.10/site-packages/lithops-3.0.1.dev0-py3.10.egg/lithops/wait.py", line 203, in get_result
    fs_done, _ = wait(fs=fs, throw_except=throw_except,
  File "/home/raul/Documents/workspace/lithops-sdp-integration/lithops-venv/lib/python3.10/site-packages/lithops-3.0.1.dev0-py3.10.egg/lithops/wait.py", line 157, in wait
    raise e
  File "/home/raul/Documents/workspace/lithops-sdp-integration/lithops-venv/lib/python3.10/site-packages/lithops-3.0.1.dev0-py3.10.egg/lithops/wait.py", line 145, in wait
    time.sleep(0 if new_data else sleep_sec)
KeyboardInterrupt
2023-09-28 17:20:10,496 [INFO] executors.py:593 -- ExecutorID 3e2651-1 - Cleaning temporary data
2023-09-28 17:20:10,499 [INFO] executors.py:593 -- ExecutorID 3e2651-0 - Cleaning temporary data
JosepSampe commented 1 year ago

And k8s + s3 works well?

RaulGracia commented 1 year ago

Yes, it does:

2023-09-28 17:36:04,202 [INFO] config.py:141 -- Lithops v3.0.1.dev0 - Python3.10
2023-09-28 17:36:04,202 [DEBUG] config.py:103 -- Loading configuration from /home/raul/.lithops/config
2023-09-28 17:36:04,213 [DEBUG] config.py:193 -- Loading Serverless backend module: k8s
2023-09-28 17:36:04,382 [DEBUG] config.py:235 -- Loading Storage backend module: aws_s3
2023-09-28 17:36:04,430 [DEBUG] aws_s3.py:37 -- Creating S3 client
2023-09-28 17:36:04,489 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 17:36:05,414 [DEBUG] k8s.py:49 -- Creating Kubernetes client
2023-09-28 17:36:05,415 [DEBUG] k8s.py:64 -- Loading kubeconfig file: /home/raul/.kube/config
2023-09-28 17:36:05,437 [DEBUG] k8s.py:74 -- Using kubeconfig conetxt: lady-ursula - cluster: lady-ursula
2023-09-28 17:36:05,437 [INFO] k8s.py:96 -- Kubernetes client created - Namespace: default
2023-09-28 17:36:05,438 [DEBUG] invokers.py:93 -- ExecutorID df3ce4-0 - Invoker initialized. Max workers: 100
2023-09-28 17:36:05,438 [DEBUG] executors.py:145 -- Function executor for k8s created with ID: df3ce4-0
2023-09-28 17:36:05,438 [INFO] invokers.py:107 -- ExecutorID df3ce4-0 | JobID M000 - Selected Runtime: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 - 512MB
2023-09-28 17:36:05,438 [DEBUG] storage.py:438 -- Runtime metadata not found in local cache. Retrieving it from storage
2023-09-28 17:36:05,438 [DEBUG] storage.py:441 -- Trying to download runtime metadata from: aws_s3://lithops-serverless-data/lithops.runtimes/k8s/3.0.1.dev0/lady-ursula/default/70b09a/lithops-worker-301dev0-7d60cfbee4.meta.json
2023-09-28 17:36:05,561 [DEBUG] storage.py:460 -- Runtime metadata not found in storage
2023-09-28 17:36:05,561 [INFO] invokers.py:115 -- Runtime docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 with 512MB is not yet deployed
2023-09-28 17:36:05,561 [INFO] k8s.py:224 -- Deploying runtime: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0 - Memory: 512 Timeout: 600
2023-09-28 17:36:05,561 [INFO] k8s.py:411 -- Extracting metadata from: docker.io/jsampe/lithops-kubernetes-default-v310:3.0.1.dev0
2023-09-28 17:36:06,634 [DEBUG] k8s.py:449 -- Waiting for runtime metadata
2023-09-28 17:36:07,050 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:10,814 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:17,463 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:17,825 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:19,833 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:19,874 [DEBUG] k8s.py:461 -- ...
2023-09-28 17:36:27,461 [DEBUG] k8s.py:467 -- Runtime metadata generated successfully
2023-09-28 17:36:28,131 [DEBUG] storage.py:471 -- Uploading runtime metadata to: aws_s3://lithops-serverless-data/lithops.runtimes/k8s/3.0.1.dev0/lady-ursula/default/70b09a/lithops-worker-301dev0-7d60cfbee4.meta.json
2023-09-28 17:36:28,485 [DEBUG] aws_s3.py:104 -- PUT Object lithops.runtimes/k8s/3.0.1.dev0/lady-ursula/default/70b09a/lithops-worker-301dev0-7d60cfbee4.meta.json - Size: 6.6KiB - OK
2023-09-28 17:36:28,485 [DEBUG] storage.py:477 -- Storing runtime metadata into local cache: /home/raul/.lithops/cache/lithops.runtimes/k8s/3.0.1.dev0/lady-ursula/default/70b09a/lithops-worker-301dev0-7d60cfbee4.meta.json
2023-09-28 17:36:28,486 [DEBUG] job.py:234 -- ExecutorID df3ce4-0 | JobID M000 - Serializing function and data
2023-09-28 17:36:28,486 [DEBUG] serialize.py:84 -- Referenced modules: None
2023-09-28 17:36:28,486 [DEBUG] serialize.py:101 -- Modules to transmit: None
2023-09-28 17:36:28,486 [DEBUG] job.py:271 -- ExecutorID df3ce4-0 | JobID M000 - Uploading function and modules to the storage backend
2023-09-28 17:36:28,728 [DEBUG] aws_s3.py:104 -- PUT Object lithops.jobs/df3ce4-0/d83fdacba0a250bba12265fecbcdda4d.func.pickle - Size: 585.0B - OK
2023-09-28 17:36:28,728 [DEBUG] job.py:297 -- ExecutorID df3ce4-0 | JobID M000 - Uploading data to the storage backend
2023-09-28 17:36:28,965 [DEBUG] aws_s3.py:104 -- PUT Object lithops.jobs/df3ce4-0-M000/aggdata.pickle - Size: 84.0B - OK
2023-09-28 17:36:28,965 [INFO] invokers.py:171 -- ExecutorID df3ce4-0 | JobID M000 - Starting function invocation: double() - Total: 4 activations
2023-09-28 17:36:28,966 [DEBUG] invokers.py:176 -- ExecutorID df3ce4-0 | JobID M000 - Worker processes: 1 - Chunksize: 1
2023-09-28 17:36:29,804 [DEBUG] k8s.py:304 -- Starting Lithops master Pod
2023-09-28 17:36:30,644 [DEBUG] k8s.py:339 -- Waiting Lithops master pod to be ready
2023-09-28 17:36:33,356 [DEBUG] k8s.py:391 -- ExecutorID df3ce4-0 | JobID M000 - Going to run 4 activations in 4 workers
2023-09-28 17:36:33,777 [DEBUG] invokers.py:252 -- ExecutorID df3ce4-0 | JobID M000 - Job invoked (4.812s) - Activation ID: lithops-df3ce4-0-m000
2023-09-28 17:36:33,777 [INFO] invokers.py:207 -- ExecutorID df3ce4-0 | JobID M000 - View execution logs at /tmp/lithops-raul/logs/df3ce4-0-M000.log
2023-09-28 17:36:33,778 [DEBUG] monitor.py:381 -- ExecutorID df3ce4-0 - Starting Storage job monitor
2023-09-28 17:36:33,778 [INFO] wait.py:98 -- ExecutorID df3ce4-0 - Getting results from 4 function activations
2023-09-28 17:36:33,778 [DEBUG] aws_s3.py:37 -- Creating S3 client
2023-09-28 17:36:33,781 [INFO] aws_s3.py:68 -- S3 client created - Region: us-east-2
2023-09-28 17:36:34,479 [DEBUG] monitor.py:381 -- ExecutorID df3ce4-0 - Starting Storage job monitor
2023-09-28 17:36:37,039 [DEBUG] monitor.py:127 -- ExecutorID df3ce4-0 - Pending: 4 - Running: 0 - Done: 0
2023-09-28 17:36:37,589 [DEBUG] monitor.py:127 -- ExecutorID df3ce4-0 - Pending: 4 - Running: 0 - Done: 0
2023-09-28 17:36:39,308 [DEBUG] monitor.py:127 -- ExecutorID df3ce4-0 - Pending: 0 - Running: 0 - Done: 4
2023-09-28 17:36:39,308 [DEBUG] monitor.py:409 -- ExecutorID df3ce4-0 - Storage job monitor finished
2023-09-28 17:36:39,487 [DEBUG] future.py:270 -- ExecutorID df3ce4-0 | JobID M000 - Got status from call 00000 - Activation ID: aadd0f32b843 - Time: 0.21 seconds
2023-09-28 17:36:39,487 [DEBUG] future.py:290 -- ExecutorID df3ce4-0 | JobID M000 - Got output from call 00000 - Activation ID: aadd0f32b843
2023-09-28 17:36:39,488 [DEBUG] future.py:270 -- ExecutorID df3ce4-0 | JobID M000 - Got status from call 00003 - Activation ID: f3808a97eace - Time: 0.23 seconds
2023-09-28 17:36:39,488 [DEBUG] future.py:290 -- ExecutorID df3ce4-0 | JobID M000 - Got output from call 00003 - Activation ID: f3808a97eace
2023-09-28 17:36:39,488 [DEBUG] future.py:270 -- ExecutorID df3ce4-0 | JobID M000 - Got status from call 00002 - Activation ID: c37c2ea6ac3f - Time: 0.23 seconds
2023-09-28 17:36:39,488 [DEBUG] future.py:270 -- ExecutorID df3ce4-0 | JobID M000 - Got status from call 00001 - Activation ID: b472ba19a218 - Time: 0.21 seconds
2023-09-28 17:36:39,488 [DEBUG] future.py:290 -- ExecutorID df3ce4-0 | JobID M000 - Got output from call 00002 - Activation ID: c37c2ea6ac3f
2023-09-28 17:36:39,488 [DEBUG] future.py:290 -- ExecutorID df3ce4-0 | JobID M000 - Got output from call 00001 - Activation ID: b472ba19a218
2023-09-28 17:36:39,489 [DEBUG] wait.py:214 -- ExecutorID df3ce4-0 - Finished getting results
[10, 12, 14, 16]
2023-09-28 17:36:39,491 [INFO] executors.py:593 -- ExecutorID df3ce4-0 - Cleaning temporary data

Also, when running jobs in K8s + Redis that work with strings, it seems to work well. So I wonder if the problem is specific when managing lists of numbers in Redis.

JosepSampe commented 1 year ago

I've been now able to deploy a redis server, so I tested k8s + redis and it works fine in my case, and I'm not able to reproduce the issue.

Could you provide logs of the functions executed in K8s? like before, but with DEBUG activated as you have now?

JosepSampe commented 1 year ago

After the successful execution of k8s+s3, Can you try again with k8s+redis? I have the feeling that it didn't update the container when you initially ran k8s+redis (and produced the error). And now i can see in the latest logs that the container is updated, so I think k8s+redis should work

JosepSampe commented 1 year ago

One important thing to know when using the master branch or doing lithops development is that: it is convenient to run the lithops clean command in the console, before an execution, when the local code changes (because of a git pull or because you made changes).

This commands removes all the temporary data and cloud resources, so that in the next execution everything is redeployed again, making sure that all the new changes in your local machine are also deployed in the runtime.

You can run, for example: lithops clean -b k8s -s redis and lithops clean -b aws_lambda -s aws_s3

RaulGracia commented 1 year ago

Hi @JosepSampe,

I cleaned up both environments (K8s and AWS) as you suggested. I have even built a new runtime in my own dockerhub. After all this, I can still see the same issue when running the above job only in k8s + Redis. This is the output of the failed task running as a Kubernetes pod:

2023-09-29 09:33:18,096 [INFO] lithopsentry.py:81 -- Lithops v3.0.1.dev0 - Starting kubernetes execution
2023-09-29 09:33:18,136 [INFO] lithopsentry.py:111 -- Received range: 0-1
2023-09-29 09:33:18,137 [INFO] handler.py:60 -- Tasks received: 1 - Worker processes: 1
2023-09-29 09:33:18,170 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-29 09:33:18,170 [INFO] redis.py:43 -- Redis client created - Host: xxx
2023-09-29 09:33:18,170 [DEBUG] utils.py:43 -- Getting function and modules
2023-09-29 09:33:18,182 [DEBUG] utils.py:88 -- Getting function data
2023-09-29 09:33:18,185 [INFO] handler.py:152 -- Lithops v3.0.1.dev0 - Starting k8s execution
2023-09-29 09:33:18,185 [INFO] handler.py:153 -- Execution ID: 682635-1-M000/00000
2023-09-29 09:33:18,185 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-29 09:33:18,185 [INFO] redis.py:43 -- Redis client created - Host: xxx
2023-09-29 09:33:18,185 [DEBUG] handler.py:165 -- Runtime: docker.io/raulgracia/lithops-kubernetes-default-v310:3.0.1.dev0 - Memory: 512MB - Timeout: 595 seconds
2023-09-29 09:33:18,194 [DEBUG] handler.py:178 -- Starting JobRunner process
2023-09-29 09:33:18,200 [DEBUG] jobrunner.py:203 -- Process started
2023-09-29 09:33:18,202 [INFO] jobrunner.py:233 -- Going to execute 'double()'
---------------------- FUNCTION LOG ----------------------
Executing double() function
----------------------------------------------------------
2023-09-29 09:33:18,202 [INFO] jobrunner.py:239 -- Success function execution
2023-09-29 09:33:18,203 [DEBUG] jobrunner.py:253 -- Pickling result
2023-09-29 09:33:18,204 [INFO] jobrunner.py:311 -- Process finished
2023-09-29 09:33:18,207 [DEBUG] handler.py:182 -- JobRunner process finished
2023-09-29 09:33:18,208 [INFO] status.py:88 -- Storing execution stats - Size: 1.2KiB
2023-09-29 09:33:18,215 [INFO] handler.py:243 -- Finished
2023-09-29 09:33:18,222 [INFO] lithopsentry.py:111 -- Received range: 1-2
2023-09-29 09:33:18,223 [INFO] handler.py:60 -- Tasks received: 0 - Worker processes: 0
2023-09-29 09:33:18,223 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-29 09:33:18,223 [INFO] redis.py:43 -- Redis client created - Host: xxx
2023-09-29 09:33:18,224 [DEBUG] utils.py:43 -- Getting function and modules
2023-09-29 09:33:18,230 [DEBUG] utils.py:88 -- Getting function data
Traceback (most recent call last):
  File "/lithops/lithopsentry.py", line 140, in <module>
    func()
  File "/lithops/lithopsentry.py", line 121, in run_job
    function_handler(payload)
  File "/lithops/lithops/worker/handler.py", line 70, in function_handler
    job_data = get_function_data(job, internal_storage)
  File "/lithops/lithops/worker/utils.py", line 93, in get_function_data
    init_byte = job.data_byte_ranges[0][0]
IndexError: list index out of range

It is the very same issue as yesterday, so nothing really changed. The fact that the K8s pod can do the computation when consuming data from S3 suggests that the problem is related to the management of the input number lists in Redis. Just FYI, in my test cluster I'm using Redis 7.2.1 (following these instructions).

According to the above log, the function double() actually gets executed; I added a new log line printing Executing double() function, and you can see it in the logs. Therefore, the execution of the function happens and completes, but the problem seems to be when getting the resulting data from Redis back to the user (or something along these lines).

At the moment I don't see this issue being critical for the experiments I want to run, but it seems a regression that may impact other jobs using Redis as well. If I find something else I will share it here.

JosepSampe commented 1 year ago

I think this PR can solve your issue #1165

Can you update your master branch, cleanup, and try again?

RaulGracia commented 1 year ago

Thanks @JosepSampe,

I have pulled the latest changes from master, installed lithops master in my virtual environment, and cleaned up the k8s environment. I don't know if I also needed to re-create the execution environment, but you have not mentioned that.

The result is the same:

2023-09-29 10:15:47,451 [DEBUG] utils.py:43 -- Getting function and modules
2023-09-29 10:15:47,456 [DEBUG] utils.py:88 -- Getting function data
Traceback (most recent call last):
  File "/lithops/lithopsentry.py", line 140, in <module>
    func()
  File "/lithops/lithopsentry.py", line 121, in run_job
    function_handler(payload)
  File "/lithops/lithops/worker/handler.py", line 70, in function_handler
    job_data = get_function_data(job, internal_storage)
  File "/lithops/lithops/worker/utils.py", line 93, in get_function_data
    init_byte = job.data_byte_ranges[0][0]
IndexError: list index out of range

I think that it would be better to try thinking of a test that could reproduce the issue, if possible (the fact of using Redis may induce complexity to the test, though). Also, I noticed that debugging the file where the issue happens /lithops/lithops/worker/utils.py may require to update the execution environment (i.e., I added some logs and after reinstalling lithops and cleaning up, the logs didn't show, so I assume that this file is in the execution environment).

JosepSampe commented 1 year ago

One quick question, in the k8s logs, this log appears on line 111 or line 112?

2023-09-29 09:33:18,222 [INFO] lithopsentry.py:111 -- Received range: 1-2
RaulGracia commented 1 year ago

Apparently, it appears in line 111: 2023-09-29 10:24:24,114 [INFO] lithopsentry.py:111 -- Received range: 1-2

2023-09-29 10:24:24,102 [INFO] jobrunner.py:239 -- Success function execution
2023-09-29 10:24:24,102 [DEBUG] jobrunner.py:253 -- Pickling result
2023-09-29 10:24:24,103 [INFO] jobrunner.py:311 -- Process finished
2023-09-29 10:24:24,106 [DEBUG] handler.py:182 -- JobRunner process finished
2023-09-29 10:24:24,108 [INFO] status.py:88 -- Storing execution stats - Size: 1.2KiB
2023-09-29 10:24:24,109 [INFO] handler.py:243 -- Finished
2023-09-29 10:24:24,114 [INFO] lithopsentry.py:111 -- Received range: 1-2
2023-09-29 10:24:24,115 [INFO] handler.py:60 -- Tasks received: 0 - Worker processes: 0
2023-09-29 10:24:24,116 [DEBUG] redis.py:32 -- Creating Redis storage client
2023-09-29 10:24:24,116 [INFO] redis.py:43 -- Redis client created - Host: xxx
2023-09-29 10:24:24,116 [DEBUG] utils.py:43 -- Getting function and modules
2023-09-29 10:24:24,122 [DEBUG] utils.py:88 -- Getting function data
Traceback (most recent call last):
  File "/lithops/lithopsentry.py", line 140, in <module>
    func()
  File "/lithops/lithopsentry.py", line 121, in run_job
    function_handler(payload)
  File "/lithops/lithops/worker/handler.py", line 70, in function_handler
    job_data = get_function_data(job, internal_storage)
  File "/lithops/lithops/worker/utils.py", line 93, in get_function_data
    init_byte = job.data_byte_ranges[0][0]
IndexError: list index out of range
JosepSampe commented 1 year ago

With the latest changes in the master branch it should now appear at 112. If it still appears in 111 means the docker image is not updated, or it is updated but not fetched from the cluster