lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

Localhost backend v2 causes intermittent failures #1369

Closed tomwhite closed 4 months ago

tomwhite commented 4 months ago

I have only seen this on Linux CI (see https://github.com/cubed-dev/cubed/pull/471), and haven't been able to reproduce locally on a Mac.

The fix is to use v1 of the localhost backend - this should probably be made the default again until the problem is diagnosed and fixed.

@JosepSampe what is the difference between the two versions? I couldn't see it explained anywhere in GH issues but perhaps I missed it.

Also, I noticed that the documentation says the default is v1, and that v2 is beta: https://lithops-cloud.github.io/docs/source/compute_config/localhost.html#summary-of-configuration-keys-for-localhost.

JosepSampe commented 4 months ago

The issue that explains v1 and v2 is https://github.com/lithops-cloud/lithops/issues/1174

In short, the difference between v1 and v2 is the granularity of execution. In v1 the granularity is at the job level and in v2 the granularity is at the task level. Let me explain this with an example:

Lets assume that we have 8 cores in our machine, a function called my_func that takes 5 minutes to complete, and the next code:

def my_func():
    time.sleep(300)

fexec = FunctionExecutor()
fexec.map(my_func, range(1))
fexec.map(my_func, range(7))
fexec.get_result()

So, v2 is more precise in what you will experience when running tasks in the cloud.

In addition, the recently added "Retry" feature will also not work as expected with version 1. Since if you have a 10 minute straggler, the rerty logic will invoke a new task, however this new task will not run until the straggler finish, which is not the expected behavior in this case.

tomwhite commented 4 months ago

Thanks for the explanation @JosepSampe! That makes a lot of sense, and v2 will certainly help with local retries (and backup tasks).

Given that v2 seems less reliable do you think we change it so it's not the default? (I can't provide a good reproducible example though I'm afraid!)

JosepSampe commented 4 months ago

I'm now investigating the issue in my cubed fork. I activated the debug logs and ran the SlowTests, and for now it failed 1/10 times.

Based on the logs, it seems the issue is located in the worker's code, since the localhost backend main code seems to work correctly:

2024-06-04T15:36:30.4305755Z INFO     lithops.invokers:invokers.py:174 ExecutorID 9d522b-18 | JobID M000 - Starting function invocation: <lambda>() - Total: 2 activations
2024-06-04T15:36:30.4308357Z DEBUG    lithops.invokers:invokers.py:200 ExecutorID 9d522b-18 | JobID M000 - Worker processes: 4 - Chunksize: 4
2024-06-04T15:36:30.4310652Z DEBUG    lithops.localhost.v2.localhost:localhost.py:137 ExecutorID 9d522b-18 | JobID M000 - Running 2 activations in the localhost worker
2024-06-04T15:36:30.4313782Z DEBUG    lithops.localhost.v2.localhost:localhost.py:102 Staring localhost job manager
2024-06-04T15:36:30.4315476Z DEBUG    lithops.localhost.v2.localhost:localhost.py:258 Starting Localhost work queue consumer threads
2024-06-04T15:36:30.4324323Z DEBUG    lithops.localhost.v2.localhost:localhost.py:322 Going to execute task process 9d522b-18-M000-00000
2024-06-04T15:36:30.4329182Z DEBUG    lithops.localhost.v2.localhost:localhost.py:322 Going to execute task process 9d522b-18-M000-00001
2024-06-04T15:36:30.4330431Z DEBUG    lithops.invokers:invokers.py:259 ExecutorID 9d522b-18 | JobID M000 - Job invoked (0.002s) - Activation ID: 9d522b-18-M000
2024-06-04T15:36:30.4386672Z INFO     lithops.invokers:invokers.py:213 ExecutorID 9d522b-18 | JobID M000 - View execution logs at /tmp/lithops-runner/logs/9d522b-18-M000.log
2024-06-04T15:36:30.4451812Z DEBUG    lithops.monitor:monitor.py:429 ExecutorID 9d522b-18 - Starting Storage job monitor
2024-06-04T15:36:30.4455542Z INFO     lithops.wait:wait.py:105 ExecutorID 9d522b-18 - Waiting for 0% of 2 function activations to complete
2024-06-04T15:36:30.5482793Z DEBUG    lithops.monitor:monitor.py:144 ExecutorID 9d522b-18 - Pending: 2 - Running: 0 - Done: 0
2024-06-04T15:36:30.7149994Z DEBUG    lithops.localhost.v2.localhost:localhost.py:328 Task process 9d522b-18-M000-00001 finished
2024-06-04T15:36:30.7514894Z DEBUG    lithops.monitor:monitor.py:144 ExecutorID 9d522b-18 - Pending: 1 - Running: 1 - Done: 0
2024-06-04T15:36:31.0086507Z DEBUG    lithops.localhost.v2.localhost:localhost.py:328 Task process 9d522b-18-M000-00000 finished
2024-06-04T15:36:31.0088014Z DEBUG    lithops.localhost.v2.localhost:localhost.py:114 Localhost job manager finished
2024-06-04T15:36:31.0561917Z DEBUG    lithops.monitor:monitor.py:144 ExecutorID 9d522b-18 - Pending: 1 - Running: 0 - Done: 1
2024-06-04T15:36:31.4477251Z INFO     lithops.wait:wait.py:105 ExecutorID 9d522b-18 - Waiting for 0% of 2 function activations to complete
2024-06-04T15:36:31.4486421Z DEBUG    lithops.future:future.py:229 ExecutorID 9d522b-18 | JobID M000 - Got status from call 00000 - Activation ID: e440f508cb4b - Time: 0.27 seconds
2024-06-04T15:36:32.4502281Z INFO     lithops.wait:wait.py:105 ExecutorID 9d522b-18 - Waiting for 0% of 1 function activations to complete
2024-06-04T15:36:33.4517835Z INFO     lithops.wait:wait.py:105 ExecutorID 9d522b-18 - Waiting for 0% of 1 function activations to complete
2024-06-04T15:36:34.4532369Z INFO     lithops.wait:wait.py:105 ExecutorID 9d522b-18 - Waiting for 0% of 1 function activations to complete

As as you can see in the logs we have: -> Going to execute task process 9d522b-18-M000-00000 -> Going to execute task process 9d522b-18-M000-00001 and: -> Task process 9d522b-18-M000-00001 finished -> Task process 9d522b-18-M000-00000 finished

This means that both processes started and ended, but according to lithops logs, one of them probably ended due to some uncaught exception, so the taskstatus.txt file is never written to the localhost storage backend, resulting in which makes the monitor wait forever.

I suspect the problem may be located in the runner.py file. Since in v1 this file starts only once and in a single subprocesses, and in v2 this file starts as many times as there are tasks in the map, and in multiple parallel subprocesses at the same time.

I'm continuing investigating to verify this.

JosepSampe commented 4 months ago

Finally, I managed to see the full logs, and as I suspected, the issue was located in the worker's code, not in the localhost v2 implementation. Probably, the way v2 works causes this issue to appear from time to time, whereas it doesn't appear in v1.

The issue is related to this https://github.com/lithops-cloud/lithops/issues/1346. It seems that, apart from the stated issue there, a different issue can still appear from time to time due to the code that loads the function object from cache/storage.

It is now fixed on the master branch. I will continue running tests, and if the issue does not reappear, I will create a new release, '3.4.1,' with this fix

tomwhite commented 4 months ago

Thanks for investigating this @JosepSampe!

JosepSampe commented 4 months ago

I've been running the cubed Slow Tests once an hour for 3 days, and the issue hasn't appeared anymore since I added the fix 3 days ago. So, this issue seems to be fixed. I just released the version 3.4.1. So, feel free to use either localhost v1 or v2.

tomwhite commented 4 months ago

Excellent news - thanks again Josep!