lithops-cloud / lithops

A multi-cloud framework for big data analytics and embarrassingly parallel jobs, that provides an universal API for building parallel applications in the cloud ☁️🚀
http://lithops.cloud
Apache License 2.0
320 stars 106 forks source link

Intermittent errors in ibm_vpc create mode #545

Closed LachlanStuart closed 3 years ago

LachlanStuart commented 3 years ago

The following errors happen to me intermittently when using the ibm_vpc backend in create mode.

Error: Internal server error, Code: 500

Logs: https://gist.github.com/LachlanStuart/69bd5db726ffd4da97f4cdb26d770607 This request retried and execution continued successfully. However, I'd prefer to not see this error at all, as logged errors mean something went wrong, and they make me worry that something might not have been initialized or cleaned up properly. If server errors are expected here, it would be best to log the error at DEBUG level when retrying (and specifically say that it will retry), and only at the ERROR log level once the number of retries is too high (e.g. >5 attempts).

Error with DisassociateFloatingip Virtual Instance request, Code: 500

Logs: https://gist.github.com/LachlanStuart/c04a3dda3f83b849b0807606ca325715 This happens most times I call fexec.dismantle(). After this error, the floating IP was deleted but the instance is still running. It eventually .

Error: VSI not found, Code: 404 when calling dismantle()

Logs: https://gist.github.com/LachlanStuart/3dd74ac60c9ca85980031a3b865cbe32 This happened when fexec.dismantle() was called after the instance had already automatically soft-dismantled.

Error: VSI not found, Code: 404 when a job times out

Logs: https://gist.github.com/LachlanStuart/49a251c73d64d3e66ec3edb97f7c5ccd Could be the same as the above issue. This happened while waiting for fexec.get_result() - I think several tasks timed out, triggering the soft dismantle timeout.

OSError: [Errno 24] Too many open files

runner.log file from VPC: https://gist.github.com/LachlanStuart/3542bd2c434b0e5ec308a182ef2b8854

This seems to happen both in create and consume mode. I don't think I got this error with Lithops 2.2.14, but it happens frequently with 2.2.16. This error prevents the status file from being written, so the host computer eventually just times out.

LachlanStuart commented 3 years ago

I've added logs from a OSError: [Errno 24] Too many open files error. It's not strictly a create-mode issue, but it seems to have been introduced with the create-mode changes so I think it's still relevant to this issue.

JosepSampe commented 3 years ago

@LachlanStuart These errors regarding VPC API calls will be fixed in #548

Regarding these exceptions, note that the exceptions are catched by lithops, however the ibm_cloud_sdk_core lib always prints this minimum exception:

Traceback (most recent call last):
  File "/home/lachlan/miniconda3/envs/sm38/lib/python3.8/site-packages/ibm_cloud_sdk_core/base_service.py", line 246, in send
    raise ApiException(
ibm_cloud_sdk_core.api_exception.ApiException: Error: Internal server error, Code: 500

This is caused by this line. I don't know why, but they use the base logging lib to print the exception (in root logging) and then they call raise

except ApiException as err:
    logging.exception(err.message)
    raise

instead of simply do this:

except ApiException as err:
    raise err

I still haven't found a way to disable this annoying print

Maybe related issue: https://github.com/IBM/python-sdk-core/issues/60

LachlanStuart commented 3 years ago

@JosepSampe Thanks for looking into it. I see Gil re-raised it in the issue you linked and they've removed the logging in the latest version. Thanks @gilv too.

JosepSampe commented 3 years ago

Just tested new ibm_cloud_sdk_core version, and no more annoying messages are printed.

548 is ready to test

gilv commented 3 years ago

@JosepSampe does lithops force to use new ibm-cloud-sdk-core version or it need to be installed manually by the user? I think Lithops doesn't use latest ibm-cloud-sdk-core, since Lithops use ibm-vpc and that one https://github.com/IBM/vpc-python-sdk/blob/master/requirements.txt doesn't push the latest one. May be we add to Lithops ibm-cloud-sdk-core >= 3.5.1 ?

JosepSampe commented 3 years ago

Theoretically new installations will use the new version, and those of us who have it already installed we have to update it manually

JosepSampe commented 3 years ago

@LachlanStuart #548 is already merged in master branch and ready to use. Any feedback/suggestions will be much appreciated.

I want to tackle now the OSError: [Errno 24] Too many open files erros produced in the runner.py. Just a quick question: Is this error produced when you submit a huge amount of invocations to the same VM? Or does it happen even if you invoke a call_async()? or does it happen after multiple maps() in the same VM?

LachlanStuart commented 3 years ago

@JosepSampe I've seen it with both 1 and 16 invocations submitted via map(). I don't think I've seen it after running multiple jobs in the same VM, but most of my testing was with Lithops v2.2.16, which never runs more than 1 job per VM when in "create" mode.

It's strangely intermittent. I've seen it 6 times out of approximately 40 invocations - 3 times across 2 separate 16-invocation calls, and 2 times when testing a 1-invocation call in "create" mode, and 1 time when testing a 1-invocation call in "consume" mode. All tests were using a very simple function def foo(i): return i + 3.

gilv commented 3 years ago

@JosepSampe isn't too many open files is related Python itself? The way to resolve this to raise them to higher level, no?

    import resource
    soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
    # Raising the soft limit. Hard limits can be raised only by sudo users
    resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
    soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
JosepSampe commented 3 years ago

The Standalone code has changed alot since this issue, so we can assume this is already fixed. If there is another problem we can open another separate issue