lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
315 stars 103 forks source link

[Standalone] BUG: several tasks fail and get stuck in pending #1346

Closed kikemolina3 closed 3 months ago

kikemolina3 commented 3 months ago

Hello,

When I was experimenting launching a large map (e.g. 1200 tasks) in EC2 standalone mode, I could notice that some of these tasks never passed the Pending status. I have experimented some time before encountering this problem, so the failure ratio is low: in the map stage of 1200 roles, only approx. 5~10 roles fail.

Entering the localhost-runner.log file in the VM worker, I can find the error EOFError: Ran out of input, inside the get_function_and_modules function, only for the failed processes.

This suggests to me that some workers try to read the pickle function file when its size is already 0 (maybe it is still open by the writer process?).

I will do a PR trying to solve this problem.

Have a nice weekend!

kikemolina3 commented 3 months ago

I just realized that very recently (a few commits ago: https://github.com/lithops-cloud/lithops/commit/267601a0de46988ffa18694a0f9c9c58bfcae970) this fact was taken into account. Please feel free to close this issue and the related PR if the bug was previously resolved by you.

kikemolina3 commented 3 months ago

Closed after check issue was previously resolved in master branch