lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
317 stars 105 forks source link

Errors not handled from workers when using sklearn interface #1179

Closed foster999 closed 11 months ago

foster999 commented 12 months ago

Running an example similar to #1172 with ibm_cos storage and ibm_cf backend errors on workers, but the error handling on the local driver fails:

Traceback (most recent call last):
  File "/engineering/model.py", line 146, in <module>
    grid_search.fit(spectra_for_training, classifications)
  File "/venv/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 898, in fit
    self._run_search(evaluate_candidates)
  File "/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 1422, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/venv/lib/python3.11/site-packages/sklearn/model_selection/_search.py", line 845, in evaluate_candidates
    out = parallel(
          ^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/sklearn/utils/parallel.py", line 65, in __call__
    return super().__call__(iterable_with_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
                                                ^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 1717, in _retrieve
    batched_results = batched_results.get_result(self.timeout)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 749, in get_result
    return self._return_or_raise()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 754, in _return_or_raise
    raise self._result
  File "/venv/lib/python3.11/site-packages/joblib/parallel.py", line 741, in get_result
    result = self.job.get(timeout=timeout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/lithops/multiprocessing/pool.py", line 252, in get
    self._value = self._executor.get_result(self._futures, timeout=timeout)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/lithops/executors.py", line 480, in get_result
    fs_done, _ = self.wait(
                 ^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/lithops/executors.py", line 448, in wait
    raise e
  File "/venv/lib/python3.11/site-packages/lithops/executors.py", line 423, in wait
    wait(fs=futures,
  File "/venv/lib/python3.11/site-packages/lithops/wait.py", line 160, in wait
    raise e
  File "/venv/lib/python3.11/site-packages/lithops/wait.py", line 141, in wait
    new_data = _get_executor_data(fs, executor_data, pbar=pbar,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/lithops/wait.py", line 285, in _get_executor_data
    list(pool.map(get_result, fs_to_wait_on))
  File "/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/.pyenv/versions/3.11.4/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.11/site-packages/lithops/wait.py", line 278, in get_result
    f.result(throw_except=throw_except, internal_storage=exec_data.internal_storage)
  File "/venv/lib/python3.11/site-packages/lithops/future.py", line 324, in result
    self.status(throw_except=throw_except, internal_storage=internal_storage)
  File "/venv/lib/python3.11/site-packages/lithops/future.py", line 219, in status
    self._exception = pickle.loads(eval(self._call_status['exc_info']))
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unpickle_exception() takes 4 positional arguments but 7 were given

I've had a few different errors on the driver, which are separate issues I'm working through. Each seems to be a standard python Exception. This is an example:

        "2023-10-23T13:43:39.859882Z    stdout: [INFO] __main__.py:43 -- Lithops v3.0.2.dev0 - Starting IBM CF execution",
        "2023-10-23T13:43:39.860564Z    stderr: 2023-10-23 13:43:39,860 [INFO] handler.py:60 -- Tasks received: 1 - Worker processes: 1",
        "2023-10-23T13:43:39.958758Z    stderr: 2023-10-23 13:43:39,958 [DEBUG] ibm_cos.py:40 -- Creating IBM COS client",
        "2023-10-23T13:43:39.958796Z    stderr: 2023-10-23 13:43:39,958 [DEBUG] ibm_cos.py:54 -- Set IBM COS Endpoint to https://s3.private.eu-gb.cloud-object-storage.appdomain.cloud",
        "2023-10-23T13:43:39.958913Z    stderr: 2023-10-23 13:43:39,958 [DEBUG] ibm_cos.py:77 -- Using IBM API key for COS authentication",
        "2023-10-23T13:43:39.959352Z    stderr: 2023-10-23 13:43:39,959 [DEBUG] ibm_token_manager.py:132 -- Requesting new IAM token",
        "2023-10-23T13:43:40.096142Z    stderr: 2023-10-23 13:43:40,095 [ERROR] storage.py:66 -- An exception was produced trying to create the 'ibm_cos' storage backend",
        "2023-10-23T13:43:40.097995Z    stderr: Traceback (most recent call last):",
        "2023-10-23T13:43:40.098822Z    stderr: File \"/pythonAction/pythonrunner.py\", line 88, in run",
        "2023-10-23T13:43:40.098831Z    stderr: exec('fun = %s(param)' % self.mainFn, self.global_context)",
        "2023-10-23T13:43:40.098835Z    stderr: File \"<string>\", line 1, in <module>",
        "2023-10-23T13:43:40.098838Z    stderr: File \"__main__.py\", line 44, in main",
        "2023-10-23T13:43:40.098842Z    stderr: function_handler(args)",
        "2023-10-23T13:43:40.098846Z    stderr: File \"/action/lithops/worker/handler.py\", line 68, in function_handler",
        "2023-10-23T13:43:40.098850Z    stderr: internal_storage = InternalStorage(storage_config)",
        "2023-10-23T13:43:40.098853Z    stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.098857Z    stderr: File \"/action/lithops/storage/storage.py\", line 302, in __init__",
        "2023-10-23T13:43:40.098862Z    stderr: self.storage = Storage(storage_config=storage_config)",
        "2023-10-23T13:43:40.098866Z    stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.098870Z    stderr: File \"/action/lithops/storage/storage.py\", line 68, in __init__",
        "2023-10-23T13:43:40.098874Z    stderr: raise e",
        "2023-10-23T13:43:40.098961Z    stderr: File \"/action/lithops/storage/storage.py\", line 64, in __init__",
        "2023-10-23T13:43:40.098981Z    stderr: self.storage_handler = StorageBackend(self.config[self.backend])",
        "2023-10-23T13:43:40.098987Z    stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.098992Z    stderr: File \"/action/lithops/storage/backends/ibm_cos/ibm_cos.py\", line 95, in __init__",
        "2023-10-23T13:43:40.098996Z    stderr: token, expiry_time = token_manager.get_token()",
        "2023-10-23T13:43:40.098999Z    stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.099002Z    stderr: File \"/action/lithops/util/ibm_token_manager.py\", line 103, in get_token",
        "2023-10-23T13:43:40.099006Z    stderr: self.refresh_token()",
        "2023-10-23T13:43:40.099010Z    stderr: File \"/action/lithops/util/ibm_token_manager.py\", line 92, in refresh_token",
        "2023-10-23T13:43:40.099014Z    stderr: self._generate_new_token()",
        "2023-10-23T13:43:40.099017Z    stderr: File \"/action/lithops/util/ibm_token_manager.py\", line 134, in _generate_new_token",
        "2023-10-23T13:43:40.099021Z    stderr: self.token = auth.token_manager.get_token()",
        "2023-10-23T13:43:40.099025Z    stderr: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.099029Z    stderr: File \"/usr/local/lib/python3.11/site-packages/ibm_cloud_sdk_core/token_managers/token_manager.py\", line 76, in get_token",
        "2023-10-23T13:43:40.099503Z    stderr: self.paced_request_token()",
        "2023-10-23T13:43:40.099512Z    stderr: File \"/usr/local/lib/python3.11/site-packages/ibm_cloud_sdk_core/token_managers/token_manager.py\", line 126, in paced_request_token",
        "2023-10-23T13:43:40.099516Z    stderr: token_response = self.request_token()",
        "2023-10-23T13:43:40.099519Z    stderr: ^^^^^^^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.099533Z    stderr: File \"/usr/local/lib/python3.11/site-packages/ibm_cloud_sdk_core/token_managers/iam_request_based_token_manager.py\", line 114, in request_token",
        "2023-10-23T13:43:40.099538Z    stderr: response = self._request(",
        "2023-10-23T13:43:40.099543Z    stderr: ^^^^^^^^^^^^^^",
        "2023-10-23T13:43:40.099545Z    stderr: File \"/usr/local/lib/python3.11/site-packages/ibm_cloud_sdk_core/token_managers/jwt_token_manager.py\", line 91, in _request",
        "2023-10-23T13:43:40.099563Z    stderr: raise ApiException(response.status_code, http_response=response)",
        "2023-10-23T13:43:40.099569Z    stderr: ibm_cloud_sdk_core.api_exception.ApiException: Error: User login from given IP address is not permitted., Code: 401",
        "2023-10-23T13:43:40.101Z       stderr: The action did not initialize or run as expected. Log data might be missing."
JosepSampe commented 12 months ago

Need to check. We always had the same error handler in the local driver.

Based on this log:

TypeError: unpickle_exception() takes 4 positional arguments but 7 were given

Maybe something has changed in python 11?

foster999 commented 12 months ago

I was thinking similar, I'll check what is coming back from the workers with a debugger tomorrow

foster999 commented 12 months ago

One of the pickled worker errors comes back as:

b"\x80\x04\x95\x0b\x02\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(h\x02\x8c\x1dNo module named \'engineering\'\x94\x85\x94Nh\x03\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94(\x8c\x08__name__\x94\x8c\x18lithops.worker.jobrunner\x94\x8c\x08__file__\x94\x8c#/action/lithops/worker/jobrunner.py\x94u\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94h\x16\x8c\x07co_name\x94\x8c\x03run\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94M\x10\x01ubK\xd2N\x87\x94R\x94N\x89Nt\x94R\x94}\x94\x8c\x04name\x94\x8c\x0bengineering\x94sbh(\x87\x94."

But if I reproduce the error locally and pickle it I get:

b"\x80\x04\x95c\x00\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x1eNo module named 'engineering'\x94\x85\x94R\x94}\x94\x8c\x04name\x94\x8c\x0cengineering\x94sb."

The local variant is much shorter and is unpickled without issue, while the one from the remote worker is longer and raises the same exception as above from unpickle_exception()

foster999 commented 12 months ago

Sorry, I've just seen that you actually pickle sys.exec_info() rather that the Exception object. When I try this locally I get TypeError: cannot pickle 'traceback' object

foster999 commented 12 months ago

Realised that you need to install the helper from tlib. So now when I reproduce the error following this approach from the runner I get:

b"\x80\x04\x95\xd4\x01\x00\x00\x00\x00\x00\x00\x8c\x08builtins\x94\x8c\x13ModuleNotFoundError\x94\x93\x94\x8c\x16tblib.pickling_support\x94\x8c\x12unpickle_exception\x94\x93\x94(h\x02\x8c\x1eNo module named 'engineeringg'\x94\x85\x94Nh\x03\x8c\x12unpickle_traceback\x94\x93\x94\x8c\x05tblib\x94\x8c\x05Frame\x94\x93\x94)\x81\x94}\x94(\x8c\x08f_locals\x94}\x94\x8c\tf_globals\x94}\x94\x8c\x08__name__\x94\x8c\x08__main__\x94s\x8c\x06f_code\x94h\n\x8c\x04Code\x94\x93\x94)\x81\x94}\x94(\x8c\x0bco_filename\x94\x8c\x07<stdin>\x94\x8c\x07co_name\x94\x8c\x08<module>\x94\x8c\x0bco_argcount\x94K\x00\x8c\x11co_kwonlyargcount\x94K\x00\x8c\x0bco_varnames\x94)\x8c\nco_nlocals\x94K\x00\x8c\x0cco_stacksize\x94K\x00\x8c\x08co_flags\x94K@\x8c\x0eco_firstlineno\x94K\x00ub\x8c\x08f_lineno\x94K\x05ubK\x02N\x87\x94R\x94t\x94R\x94}\x94\x8c\x04name\x94\x8c\x0cengineeringg\x94sbh'\x87\x94."

Which unpickles fine

JosepSampe commented 11 months ago

@foster999 @sergii-mamedov Do you know a way to reproduce this issue? I've been trying multiple things, but the exceptions are correctly handled on my side:

2023-10-26 17:55:05,246 [WARNING] future.py:249 -- ExecutorID a5d9de-0 | JobID A000 - There was an exception - Activation ID: b1cd2d434d9543498d2d434d9543491c
Traceback (most recent call last):
  File "/action/lithops/worker/jobrunner.py", line 236, in run
  File "/home/josep/dev-workspace/lithops/lithops/scripts/cli.py", line 197, in hello
    import oss2
ModuleNotFoundError: No module named 'oss2'
JosepSampe commented 11 months ago

I did more experiments and I think the problem is a mismatch version of the tblib between the cloud and the local driver. They created a version 4 days ago, so probably the runtime contains the most recent version while in you local driver you still have the previous one. Running python3.X -m pip install -U tblib on my localhost resolved the issue

foster999 commented 11 months ago

I did more experiments and I think the problem is a mismatch version of the tblib between the cloud and the local driver. They created a version 4 days ago, so probably the runtime contains the most recent version while in you local driver you still have the previous one. Running python3.X -m pip install -U tblib on my localhost resolved the issue

Thanks for persisting @JosepSampe, that worked for me!

sergii-mamedov commented 11 months ago

I had a pinned version of tblib in my Docker image. Now all good. Thank @JosepSampe