Closed sergii-mamedov closed 8 months ago
ok i tried setting up 8 GB RAM and processing went through without any problems. Do you have any ideas why this happened? Could there have been any changes to the code responsible for this?
2024-01-30 15:48:24,851 - INFO - engine.lithops-wrapper[Thread-1] - executor.py:227 - executor.map(run_coloc_job, 4 items, 8192MB, attempt 1)
2024-01-30 15:48:24,851 - DEBUG - engine.lithops-wrapper[Thread-1] - executor.py:357 - Selected executor aws_lambda
2024-01-30 15:48:24,851 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:107 - ExecutorID 66ff3d-5 | JobID M009 - Selected Runtime: metaspace-aws-lambda:3.1.0.d - 8192MB
2024-01-30 15:48:24,851 - DEBUG - lithops.storage.storage[Thread-1-ex] - storage.py:470 - Runtime metadata found in local memory cache
2024-01-30 15:48:24,851 - DEBUG - lithops.job.job[Thread-1-ex] - job.py:242 - ExecutorID 66ff3d-5 | JobID M009 - Serializing function and data
2024-01-30 15:48:24,853 - DEBUG - lithops.job.serialize[Thread-1-ex] - serialize.py:97 - Include Modules: sm
2024-01-30 15:48:24,853 - DEBUG - lithops.job.serialize[Thread-1-ex] - serialize.py:113 - Module 'sm' found in /opt/dev/metaspace/metaspace/engine/sm
2024-01-30 15:48:24,853 - DEBUG - lithops.job.serialize[Thread-1-ex] - serialize.py:118 - Modules to transmit: /opt/dev/metaspace/metaspace/engine/sm
2024-01-30 15:48:24,858 - DEBUG - lithops.job.job[Thread-1-ex] - job.py:276 - ExecutorID 66ff3d-5 | JobID M009 - Uploading function and modules to the storage backend
2024-01-30 15:48:24,952 - DEBUG - lithops.storage.backends.aws_s3.aws_s3[Thread-1-ex] - aws_s3.py:104 - PUT Object lithops.jobs/66ff3d-5/6433e09d0a451a4926949a32125022f6.func.pickle - Size: 760.7KiB - OK
2024-01-30 15:48:24,952 - DEBUG - lithops.job.job[Thread-1-ex] - job.py:302 - ExecutorID 66ff3d-5 | JobID M009 - Uploading data to the storage backend
2024-01-30 15:48:25,012 - DEBUG - lithops.storage.backends.aws_s3.aws_s3[Thread-1-ex] - aws_s3.py:104 - PUT Object lithops.jobs/66ff3d-5-M009/aggdata.pickle - Size: 519.0KiB - OK
2024-01-30 15:48:25,012 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:172 - ExecutorID 66ff3d-5 | JobID M009 - Starting function invocation: run_coloc_job() - Total: 4 activations
2024-01-30 15:48:25,012 - DEBUG - lithops.invokers[Thread-1-ex] - invokers.py:177 - ExecutorID 66ff3d-5 | JobID M009 - Worker processes: 1 - Chunksize: 1
2024-01-30 15:48:25,012 - DEBUG - lithops.invokers[Thread-1-ex] - invokers.py:425 - ExecutorID 66ff3d-5 | JobID M009 - Free workers: 564 - Going to run 4 activations in 4 workers
2024-01-30 15:48:25,012 - INFO - lithops.invokers[Thread-1-ex] - invokers.py:208 - ExecutorID 66ff3d-5 | JobID M009 - View execution logs at /tmp/lithops-ubuntu/logs/66ff3d-5-M009.log
2024-01-30 15:48:25,012 - DEBUG - lithops.monitor[Thread-27] - monitor.py:381 - ExecutorID 66ff3d-5 - Starting Storage job monitor
2024-01-30 15:48:25,012 - INFO - lithops.wait[Thread-1-ex] - wait.py:98 - ExecutorID 66ff3d-5 - Getting results from 4 function activations
2024-01-30 15:48:25,035 - DEBUG - lithops.invokers[ThreadPoolExecutor-231_48] - invokers.py:371 - ExecutorID 66ff3d-5 | JobID M009 - Calls 00001 invoked (0.022s) - Activation ID: 5924283e-3222-4f2a-8199-abdcc18081ab
2024-01-30 15:48:25,037 - DEBUG - lithops.invokers[ThreadPoolExecutor-231_33] - invokers.py:371 - ExecutorID 66ff3d-5 | JobID M009 - Calls 00000 invoked (0.025s) - Activation ID: cf1b05e8-91e9-40d6-bada-7fddfa8ea38e
2024-01-30 15:48:25,041 - DEBUG - lithops.invokers[ThreadPoolExecutor-231_25] - invokers.py:371 - ExecutorID 66ff3d-5 | JobID M009 - Calls 00003 invoked (0.026s) - Activation ID: d422eda5-73fe-4241-b705-612898a7b115
2024-01-30 15:48:25,045 - DEBUG - lithops.invokers[ThreadPoolExecutor-231_55] - invokers.py:371 - ExecutorID 66ff3d-5 | JobID M009 - Calls 00002 invoked (0.031s) - Activation ID: 734ddf49-75c0-4ba6-a7be-fd8db5fbfa66
2024-01-30 15:48:27,074 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 1 - Running: 3 - Done: 0
2024-01-30 15:48:29,113 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 4 - Done: 0
2024-01-30 15:49:01,760 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 4 - Done: 0
2024-01-30 15:49:09,945 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 3 - Done: 1
2024-01-30 15:49:10,068 - DEBUG - lithops.future[ThreadPoolExecutor-813_0] - future.py:273 - ExecutorID 66ff3d-5 | JobID M009 - Got status from call 00001 - Activation ID: 5924283e-3222-4f2a-8199-abdcc18081ab - Time: 42.89 seconds
2024-01-30 15:49:10,068 - DEBUG - lithops.future[ThreadPoolExecutor-813_0] - future.py:293 - ExecutorID 66ff3d-5 | JobID M009 - Got output from call 00001 - Activation ID: 5924283e-3222-4f2a-8199-abdcc18081ab
2024-01-30 15:49:14,562 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:49:15,075 - DEBUG - lithops.future[ThreadPoolExecutor-822_0] - future.py:273 - ExecutorID 66ff3d-5 | JobID M009 - Got status from call 00002 - Activation ID: 734ddf49-75c0-4ba6-a7be-fd8db5fbfa66 - Time: 45.39 seconds
2024-01-30 15:49:15,075 - DEBUG - lithops.future[ThreadPoolExecutor-822_0] - future.py:293 - ExecutorID 66ff3d-5 | JobID M009 - Got output from call 00002 - Activation ID: 734ddf49-75c0-4ba6-a7be-fd8db5fbfa66
2024-01-30 15:49:45,651 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:50:18,219 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:50:50,804 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:51:23,344 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:51:55,855 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 2 - Done: 2
2024-01-30 15:52:08,207 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 1 - Done: 3
2024-01-30 15:52:08,279 - DEBUG - lithops.future[ThreadPoolExecutor-1082_0] - future.py:273 - ExecutorID 66ff3d-5 | JobID M009 - Got status from call 00003 - Activation ID: d422eda5-73fe-4241-b705-612898a7b115 - Time: 221.39 seconds
2024-01-30 15:52:08,279 - DEBUG - lithops.future[ThreadPoolExecutor-1082_0] - future.py:293 - ExecutorID 66ff3d-5 | JobID M009 - Got output from call 00003 - Activation ID: d422eda5-73fe-4241-b705-612898a7b115
2024-01-30 15:52:20,908 - DEBUG - lithops.monitor[Thread-27] - monitor.py:127 - ExecutorID 66ff3d-5 - Pending: 0 - Running: 0 - Done: 4
2024-01-30 15:52:20,908 - DEBUG - lithops.monitor[Thread-27] - monitor.py:409 - ExecutorID 66ff3d-5 - Storage job monitor finished
2024-01-30 15:52:21,296 - DEBUG - lithops.future[ThreadPoolExecutor-1103_0] - future.py:273 - ExecutorID 66ff3d-5 | JobID M009 - Got status from call 00000 - Activation ID: cf1b05e8-91e9-40d6-bada-7fddfa8ea38e - Time: 232.55 seconds
2024-01-30 15:52:21,296 - DEBUG - lithops.future[ThreadPoolExecutor-1103_0] - future.py:293 - ExecutorID 66ff3d-5 | JobID M009 - Got output from call 00000 - Activation ID: cf1b05e8-91e9-40d6-bada-7fddfa8ea38e
2024-01-30 15:52:21,297 - INFO - lithops.executors[Thread-1-ex] - executors.py:596 - ExecutorID 66ff3d-5 - Cleaning temporary data
2024-01-30 15:52:21,345 - DEBUG - lithops.executors[Thread-1-ex] - executors.py:506 - ExecutorID 66ff3d-5 - Finished getting results
If the problem is related to the memory, can you check in the function stats the worker_peak_memory_start
and worker_peak_memory_end
to see if the function that fails has a "weird" memory consumption?
Additionally, in your function's code, you can use the get_memory_usage() function to log the memory usage at the moment this function is called. You can print(get_memory_usage())
multiple times inside you code and see in the logs how the memory usage evolves and see which instruction produces the highest memory consumption.
Edit: I now see that Lamba shows the memory usage at the end of the logs..
So, the issue appears when you first invoke with 4GB, one or more function calls crash because of an OOM, and then you reinvoke the same function with 8GB? If you directly invoke with 8GB it always works fine?
We have wrappers around the Lithops Executor class and also around the map function.
In them, we implemented the possibility of choosing the type of executor in accordance with the required amount of RAM. We also monitor MemoryError, TimeoutError and invoke the function with more RAM.
An example of the code that implements this:
futures, return_vals, exc = self._dispatch_map(
wrapper_func, func_args, runtime_memory, debug_run_locally, lithops_kwargs
)
................
if (
isinstance(exc, (MemoryError, TimeoutError, OSError))
and runtime_memory <= max(MEM_LIMITS.values())
and (max_memory is None or runtime_memory < max_memory)
):
attempt += 1
old_memory = runtime_memory
runtime_memory *= 2
................
futures, return_vals, exc = self._dispatch_map(
wrapper_func, func_args, runtime_memory, debug_run_locally, lithops_kwargs
)
This approach worked great with IBM Cloud Functions, IBM Code Engine, IBM VPC. After migrating to AWS, this approach works, when after invoking the lambda function
we get a MemoryError (not enough 8 GB RAM in our case) and reinvoke on EC2.
In the case when the first invoke (aws lambda) requires 4 GB and crashes due to OOM, after that we run the same function with 8 GB of RAM, which leads to the problem described above.
My hypothesis is that lithops check the status of running executors via JobMonitor. During the restart that we do on our side (4 GB -> 8GB), `Activation ID' changes, but Job Monitor only knows about past executors. The last line of the log that I cited inspired me to this hypothesis
2024-01-30 14:51:13,269 - DEBUG - lithops.monitor[Thread-14] - monitor.py:409 - ExecutorID 290acf-0 - Storage job monitor finished
P.S> Yes, if I immediately allocate 8 GB of RAM, everything works without a problem. P.S.S> this problem occurs no matter what initial amount of RAM we used.
Can you check the ephemeral storage assigned to each worker?
From my experience, workers with low memory do not throw exceptions, they take more time since each worker has less cpus assigned to them (Unless you detect that via your custom executor/map)
On the other hand, lambdas with low ephemeral memory throw memory-related exceptions if the threshold is surpassed.
@abourramouss We do not use ephemeral storage at all. In addition, this problem appears for different amounts of RAM, from 1 to 4 GB inclusive.
Hi @sergii-mamedov. I added this patch that is aimed to fix the issue you experienced in this thread. Could you test with master branch?
Thanks @JosepSampe . I will test it tomorrow.
@JosepSampe Works well. Waiting for new release :)
Hi @sergii-mamedov, I just created version 3.1.1
thanks a lot @JosepSampe
Hi @JosepSampe !
I have a strange situation for a small subset of datasets for a particular step. This step is characterized by a higher consumption of RAM and a duration of 2-4 minutes for some datasets. It also occurs when there is not enough RAM for at least one executor and we start again with a larger amount of RAM. Below I provide debug logs as well as logs of lambda functions.
lithops DEBUG logs
AWS lambda logs
27c11211-4e68-4d85-a6f8-6e9701002adb
7e566c35-c1d3-4516-ab53-29dfd32b9e48
85545c28-d632-49e1-b143-8171e1c38fb5
5a00f31f-ec62-49c0-8fb1-21995c88fe85
My only hypothesis is that due to the fact that we are restarting executors repeatedly with double the amount of memory, something happens in the middle of lithops. Although we do the same logic for other steps on AWS and did it earlier with IBM. I'm also wondering why there are no records for the last call right away
They should be at least once every 30 seconds as far as I can see from the source code.