Closed LachlanStuart closed 3 years ago
I've added logs from a OSError: [Errno 24] Too many open files
error. It's not strictly a create-mode issue, but it seems to have been introduced with the create-mode changes so I think it's still relevant to this issue.
@LachlanStuart These errors regarding VPC API calls will be fixed in #548
Regarding these exceptions, note that the exceptions are catched by lithops, however the ibm_cloud_sdk_core
lib always prints this minimum exception:
Traceback (most recent call last):
File "/home/lachlan/miniconda3/envs/sm38/lib/python3.8/site-packages/ibm_cloud_sdk_core/base_service.py", line 246, in send
raise ApiException(
ibm_cloud_sdk_core.api_exception.ApiException: Error: Internal server error, Code: 500
This is caused by this line. I don't know why, but they use the base logging lib to print the exception (in root logging) and then they call raise
except ApiException as err:
logging.exception(err.message)
raise
instead of simply do this:
except ApiException as err:
raise err
I still haven't found a way to disable this annoying print
Maybe related issue: https://github.com/IBM/python-sdk-core/issues/60
@JosepSampe Thanks for looking into it. I see Gil re-raised it in the issue you linked and they've removed the logging in the latest version. Thanks @gilv too.
Just tested new ibm_cloud_sdk_core
version, and no more annoying messages are printed.
@JosepSampe does lithops force to use new ibm-cloud-sdk-core version or it need to be installed manually by the user?
I think Lithops doesn't use latest ibm-cloud-sdk-core, since Lithops use ibm-vpc and that one https://github.com/IBM/vpc-python-sdk/blob/master/requirements.txt doesn't push the latest one. May be we add to Lithops ibm-cloud-sdk-core >= 3.5.1
?
Theoretically new installations will use the new version, and those of us who have it already installed we have to update it manually
@LachlanStuart #548 is already merged in master branch and ready to use. Any feedback/suggestions will be much appreciated.
I want to tackle now the OSError: [Errno 24] Too many open files
erros produced in the runner.py
.
Just a quick question: Is this error produced when you submit a huge amount of invocations to the same VM? Or does it happen even if you invoke a call_async()? or does it happen after multiple maps() in the same VM?
@JosepSampe I've seen it with both 1 and 16 invocations submitted via map()
. I don't think I've seen it after running multiple jobs in the same VM, but most of my testing was with Lithops v2.2.16, which never runs more than 1 job per VM when in "create" mode.
It's strangely intermittent. I've seen it 6 times out of approximately 40 invocations - 3 times across 2 separate 16-invocation calls, and 2 times when testing a 1-invocation call in "create" mode, and 1 time when testing a 1-invocation call in "consume" mode. All tests were using a very simple function def foo(i): return i + 3
.
@JosepSampe isn't too many open files is related Python itself? The way to resolve this to raise them to higher level, no?
import resource
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
# Raising the soft limit. Hard limits can be raised only by sudo users
resource.setrlimit(resource.RLIMIT_NOFILE, (10000, hard))
soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
The Standalone code has changed alot since this issue, so we can assume this is already fixed. If there is another problem we can open another separate issue
The following errors happen to me intermittently when using the ibm_vpc backend in create mode.
Error: Internal server error, Code: 500
Logs: https://gist.github.com/LachlanStuart/69bd5db726ffd4da97f4cdb26d770607 This request retried and execution continued successfully. However, I'd prefer to not see this error at all, as logged errors mean something went wrong, and they make me worry that something might not have been initialized or cleaned up properly. If server errors are expected here, it would be best to log the error at
DEBUG
level when retrying (and specifically say that it will retry), and only at theERROR
log level once the number of retries is too high (e.g. >5 attempts).Error with DisassociateFloatingip Virtual Instance request, Code: 500
Logs: https://gist.github.com/LachlanStuart/c04a3dda3f83b849b0807606ca325715 This happens most times I call
fexec.dismantle()
. After this error, the floating IP was deleted but the instance is still running. It eventually .Error: VSI not found, Code: 404
when callingdismantle()
Logs: https://gist.github.com/LachlanStuart/3dd74ac60c9ca85980031a3b865cbe32 This happened when
fexec.dismantle()
was called after the instance had already automatically soft-dismantled.Error: VSI not found, Code: 404
when a job times outLogs: https://gist.github.com/LachlanStuart/49a251c73d64d3e66ec3edb97f7c5ccd Could be the same as the above issue. This happened while waiting for
fexec.get_result()
- I think several tasks timed out, triggering the soft dismantle timeout.OSError: [Errno 24] Too many open files
runner.log
file from VPC: https://gist.github.com/LachlanStuart/3542bd2c434b0e5ec308a182ef2b8854This seems to happen both in
create
andconsume
mode. I don't think I got this error with Lithops 2.2.14, but it happens frequently with 2.2.16. This error prevents the status file from being written, so the host computer eventually just times out.