Closed foster999 closed 11 months ago
Need to check. We always had the same error handler in the local driver.
Based on this log:
TypeError: unpickle_exception() takes 4 positional arguments but 7 were given
Maybe something has changed in python 11?
I was thinking similar, I'll check what is coming back from the workers with a debugger tomorrow
One of the pickled worker errors comes back as:
b"\x80\x04\x95\x0b\x02\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(h\x02\x8c\x1dNo module named \'engineering\'\x94\x85\x94Nh\x03\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x18lithops.worker.jobrunner\x94\x8c\x08__file__\x94\x8c#/action/lithops/worker/jobrunner.py\x94u\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h\x16\x8c\x07co_name\x94\x8c\x03run\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94M\x10\x01ubK\xd2N\x87\x94R\x94N\x89Nt\x94R\x94}\x94\x8c\x04name\x94\x8c\x0bengineering\x94sbh(\x87\x94."
But if I reproduce the error locally and pickle it I get:
b"\x80\x04\x95c\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x1eNo module named 'engineering'\x94\x85\x94R\x94}\x94\x8c\x04name\x94\x8c\x0cengineering\x94sb."
The local variant is much shorter and is unpickled without issue, while the one from the remote worker is longer and raises the same exception as above from unpickle_exception()
Sorry, I've just seen that you actually pickle sys.exec_info()
rather that the Exception object. When I try this locally I get TypeError: cannot pickle 'traceback' object
Realised that you need to install the helper from tlib
. So now when I reproduce the error following this approach from the runner I get:
b"\x80\x04\x95\xd4\x01\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(h\x02\x8c\x1eNo module named 'engineeringg'\x94\x85\x94Nh\x03\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94\x8c\x08__name__\x94\x8c\x08__main__\x94s\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94\x8c\x07<stdin>\x94\x8c\x07co_name\x94\x8c\x08<module>\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94K\x05ubK\x02N\x87\x94R\x94t\x94R\x94}\x94\x8c\x04name\x94\x8c\x0cengineeringg\x94sbh'\x87\x94."
Which unpickles fine
@foster999 @sergii-mamedov Do you know a way to reproduce this issue? I've been trying multiple things, but the exceptions are correctly handled on my side:
2023-10-26 17:55:05,246 [WARNING] future.py:249 -- ExecutorID a5d9de-0 | JobID A000 - There was an exception - Activation ID: b1cd2d434d9543498d2d434d9543491c
Traceback (most recent call last):
File "/action/lithops/worker/jobrunner.py", line 236, in run
File "/home/josep/dev-workspace/lithops/lithops/scripts/cli.py", line 197, in hello
import oss2
ModuleNotFoundError: No module named 'oss2'
I did more experiments and I think the problem is a mismatch version of the tblib
between the cloud and the local driver. They created a version 4 days ago, so probably the runtime contains the most recent version while in you local driver you still have the previous one. Running python3.X -m pip install -U tblib
on my localhost resolved the issue
I did more experiments and I think the problem is a mismatch version of the
tblib
between the cloud and the local driver. They created a version 4 days ago, so probably the runtime contains the most recent version while in you local driver you still have the previous one. Runningpython3.X -m pip install -U tblib
on my localhost resolved the issue
Thanks for persisting @JosepSampe, that worked for me!
I had a pinned version of tblib
in my Docker image. Now all good. Thank @JosepSampe
Running an example similar to #1172 with ibm_cos storage and ibm_cf backend errors on workers, but the error handling on the local driver fails:
I've had a few different errors on the driver, which are separate issues I'm working through. Each seems to be a standard python Exception. This is an example: