kserve / kserve

Standardized Serverless ML Inference Platform on Kubernetes
https://kserve.github.io/website/
Apache License 2.0
3.46k stars 1.03k forks source link

Alibi Example explainer fails at startup with asyncio KeyError: '7 is not registered' #1707

Closed karlschriek closed 3 years ago

karlschriek commented 3 years ago

/kind bug

What steps did you take and what happened:

I prepared a small Alibi example based on what is described here: https://github.com/kubeflow/kfserving/blob/master/docs/samples/explanation/alibi/imagenet/README.md and use this YAML spec (https://github.com/kubeflow/kfserving/blob/master/docs/samples/explanation/alibi/imagenet/imagenet.yaml) which differs somewhat from what is in the README. Note though, that the exact same issue occurs if using the (older) spec as in the README.

I ever so slightly python code in the example, being careful not to change any of the core functionality. For reference, the code I am using is here: https://gist.github.com/karlschriek/07030528a232d7c145556e1fd0fa3442, but I do not think it plays any kind of role here.

Calling predict(predict_url, cookies=cookies, headers=headers_predict, image_path=daisy_path) results, as expected, in the following:

image

However, explain(explain_url, cookies=cookies, headers=headers_explain, image_path=daisy_path) results in:

Calling  https://imagenet-explainer-default-karl-schriek.serving.dev-kfserving-test-5.build-mlops-2.com/v1/models/imagenet:explain
Received response code and content 500 b'<html><title>500: Internal Server Error</title><body>500: Internal Server Error</body></html>'

Upon closer inspection, even though the explainer Pod comes online and and is reporting a healthy status, it in fact already fails at startup with the following error:

2021-07-07 05:49:54.458259: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-07-07 05:49:54.458297: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/usr/local/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  "update your install command.", FutureWarning)
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
[I 210707 05:49:56 font_manager:1443] generated new fontManager
[I 210707 05:49:56 parser:224] Extra args: {}
[I 210707 05:49:56 storage:50] Copying contents of /mnt/models to local
[I 210707 05:49:56 __main__:39] Loading Alibi model
[I 210707 05:49:56 explainer:50] Predict URL set to imagenet-predictor-default.karl-schriek
[I 210707 05:49:56 kfserver:151] Registering model: imagenet
[I 210707 05:49:56 kfserver:121] Setting asyncio max_workers as 5
[I 210707 05:49:56 kfserver:128] Listening on port 8080
[I 210707 05:49:56 kfserver:130] Will fork 0 workers
[I 210707 05:49:56 process:123] Starting 8 processes
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/asyncio/selector_events.py", line 256, in _add_reader
    key = self._selector.get_key(fd)
  File "/usr/local/lib/python3.7/selectors.py", line 192, in get_key
    raise KeyError("{!r} is not registered".format(fileobj)) from None
KeyError: '7 is not registered'

What did you expect to happen:

I would expect the default example to come online without any issues, but if not I would at least expect the Pod to fail

Environment:

yuzisun commented 3 years ago

@karlschriek This is most likely an issue for AsyncIO with multi-processing, could you set workers to 1?

karlschriek commented 3 years ago

I was eventually able to solve this by using "gs://seldon-models/tfserving/imagenet/explainer" instead.