Open davidspek opened 4 years ago
After applying a fix to allow Kale to compile, upload and run pipelines in a multi-user environment
I believe the fix you've implemented is not enough (as I commented in the other issue, it is also not secure, but it can work for you). Let me explain.
For starters, the two errors are different from one another.
The first error has to do with what the request is doing. I understand that you list experiments without selecting a namespace. The client you are using doesn't automatically set the namespace against which it will be performing the requests, and (if you haven't edited Kale) Kale's use of the KFP client doesn't do this either.
The second one is a permission error raised by a step which tries to submit itself to MLMD. The steps need permissions to get their workflow. Try updating the permissions of the service account default-editor
(that's the one your pods use) accordingly.
We should (and will) handle this, thanks for reporting!
@elikatsis Thank you for your quick and detailed responses. Regarding the security of the work around I implemented, I hope a proper solution will be implemented soon but for my PoC cluster I don't think this is a very large issue (not having it work is more of an issue).
About the first error. I was just about to try editing kfp.py and added namespace
to 2 functions:
def get_experiment(request, experiment_name):
"""Get a KFP experiment. If it does not exist return None."""
client = _get_client()
try:
experiment = client.get_experiment(experiment_name=experiment_name)
except ValueError as e:
err_msg = "No experiment is found with name {}".format(experiment_name)
if err_msg in str(e):
return None
else:
# Unexpected exception
raise
except TypeError as e:
# In case the installed KFP client does not contain the following fix:
# https://github.com/kubeflow/pipelines/pull/4177
err_msg = "'NoneType' object is not iterable"
if err_msg in str(e):
return None
raise
return {"id": experiment.id, "name": experiment.name, "namespace": namespace}
def create_experiment(request, experiment_name, raise_if_exists=False):
"""Create a new experiment."""
client = _get_client()
exp = get_experiment(None, experiment_name)
if not exp:
experiment = client.create_experiment(name=experiment_name)
return {"id": experiment.id, "name": experiment.name, "namespace": namespace}
if raise_if_exists:
raise ValueError("Failed to create experiment, experiment already"
" exists.")
However, looking through the KFP docs I see there is a function for get_user_namespace and set_user_namespace. Am I correct in thinking the get_user_namespace
should be added to kfp.py
and during the creation of the pipeline Kale should use set_user_namespace
?
The problem with the request is not returning the namespace, but setting it in the first place.
Yes set_user_namespace()
is a good choice, but it has this: https://github.com/kubeflow/pipelines/blob/d4e73989170cf5e44b2fe1064a904003c3c5a7ff/sdk/python/kfp/_client.py#L274-L276
self._context_setting['namespace'] = namespace
with open(Client.LOCAL_KFP_CONTEXT, 'w') as f:
json.dump(self._context_setting, f)
which will raise error if this path does not exist.
You could set client._context_setting['namespace']
directly on client initialization.
@elikatsis
I did some testing and I think I solved the issue with the namespaces for a multi-user environment. The Select Experiment
setting now correctly displays existing experiments and after clicking Compile and Run
it correctly shows information:
I am still busy trying out some optimizations for this code (checking if the namespace has already been set), but the main functionality is working. However, I am still getting the MLMD permission error, though I have updating the permissions yet as it is getting late.
That's good progress!
I am still getting the MLMD permission error, though I have updating the permissions yet as it is getting late.
What did you do about this? What permissions did you update and how?
Sorry I made a typo, I have not yet tried updating the permissions. Is it possible that the MLMD permission error is due to me using ml-metadata 0.24.0 instead of 0.23.0? I was planning to dig into the code to see if I can fix the issue properly in a bit.
@elikatsis @StefanoFioravanzo I have gotten a bit further. By applying the below RoleBinding for namespace admin
the created workflow is not getting an error while getting the workflow anymore.
cat <<EOF | kubectl apply -f -
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: allow-workflow-nb-admin
namespace: admin
subjects:
- kind: ServiceAccount
name: default-editor
namespace: admin
roleRef:
kind: ClusterRole
name: argo
apiGroup: rbac.authorization.k8s.io
EOF
However, now I have the following error:
2020-09-23 08:53:15 Kale mlmdutils:88 [INFO] ---------- Initializing MLMD context... ----------
2020-09-23 08:53:15 Kale mlmdutils:89 [INFO] Connecting to MLMD...
2020-09-23 08:53:15 Kale mlmdutils:91 [INFO] Successfully connected to MLMD
2020-09-23 08:53:15 Kale mlmdutils:92 [INFO] Getting step details...
2020-09-23 08:53:15 Kale mlmdutils:93 [INFO] Getting pod name...
2020-09-23 08:53:15 Kale mlmdutils:95 [INFO] Successfully retrieved pod name: mzquality-test-o215f-4p29m-3194236941
2020-09-23 08:53:15 Kale mlmdutils:96 [INFO] Getting pod namespace...
2020-09-23 08:53:15 Kale mlmdutils:98 [INFO] Successfully retrieved pod namespace: admin
2020-09-23 08:53:15 Kale mlmdutils:100 [INFO] Getting pod...
2020-09-23 08:53:15 Kale mlmdutils:102 [INFO] Successfully retrieved pod
2020-09-23 08:53:15 Kale mlmdutils:103 [INFO] Getting workflow name from pod...
2020-09-23 08:53:15 Kale mlmdutils:106 [INFO] Successfully retrieved workflow name: mzquality-test-o215f-4p29m
2020-09-23 08:53:15 Kale mlmdutils:108 [INFO] Getting workflow...
2020-09-23 08:53:15 Kale mlmdutils:111 [INFO] Successfully retrieved workflow
2020-09-23 08:53:15 Kale mlmdutils:116 [INFO] Successfully retrieved KFP run ID: 1a14a8ed-37ef-49ff-85ee-7a5777a360d2
2020-09-23 08:53:15 Kale mlmdutils:123 [INFO] Successfully retrieved KFP pipeline_name: mzquality-test-o215f
2020-09-23 08:53:15 Kale podutils:343 [INFO] Computing component ID for pod admin/mzquality-test-o215f-4p29m-3194236941...
2020-09-23 08:53:15 Kale podutils:354 [INFO] Computed component ID: Install mzquality@sha256=0c770aed28976cd171960a69b318c3d8c6fe0c4cae930043c3b9a6a0bd379f86
2020-09-23 08:53:15 Kale mlmdutils:136 [INFO] Failed to retrieve execution hash. Generating random string...: x67u2gxtt9
2020-09-23 08:53:15 Kale mlmdutils:258 [INFO] Creating context 'mzquality-test-o215f-4p29m' of type 'KfpRun'...
2020-09-23 08:53:15 Kale mlmdutils:273 [INFO] Context already exists
2020-09-23 08:53:15 Kale mlmdutils:274 [INFO] ContextType ID: 6 - Context ID: 14
2020-09-23 08:53:15 Kale mlmdutils:222 [INFO] Creating execution of type 'Install mzquality@sha256=0c770aed28976cd171960a69b318c3d8c6fe0c4cae930043c3b9a6a0bd379f86'...
2020-09-23 08:53:15 Kale mlmdutils:230 [INFO] Successfully created execution
2020-09-23 08:53:15 Kale mlmdutils:231 [INFO] ExecutionType ID: 15 - Execution ID: 12
2020-09-23 08:53:15 Kale mlmdutils:142 [INFO] ---------- Successfully initialized MLMD context ----------
2020-09-23 08:53:15 Kale jputils:241 [INFO] ---------- Running user code... ----------
2020-09-23 08:53:17 Kale marshalling [ERROR]
During data passing, Kale experienced an error.
The error was:
During data passing, Kale could not load the following file:
- name: 'NULL'
The error was: No file or folder was found with the requested name.
Please help us improve Kale by opening a new issue at
https://github.com/kubeflow-kale/kale/issues
.
Please help us improve Kale by opening a new issue at
https://github.com/kubeflow-kale/kale/issues
.
2020-09-23 08:53:17 Kale jputils:219 [ERROR] Received a KaleGracefulExit exception. Exiting...
2020-09-23 08:53:17 Kale jputils:299 [ERROR] ---------- Failed to run user code ----------
It seems like this might be due to the workflow I am trying to run as the candies-sharing notebook pipeline does execute successfully. However, I tried placing all the data needed for the pipeline in a separate volume and adding that in Kale and this still gave the same error.
After trying mounting a volume in the home directory, outside the home directory, using a PVC name or a PV name and using an image based on the official jupyer stacks or based on one from the repo I am still seeing the error below. I exec'ed into the pod just before it terminated and when I performed an ls
I did not see the volume that should have been mounted (I was using mount point /data
). So it seems the data volume is refusing to mount no matter what. I haven't been able to find much information in logs but I am now wondering if this might be due to namespacing.
Well, step by step I came to this issue as well.
I'm using Kale with https://github.com/kubeflow-kale/kale/commit/4a5d7e63542c550f8fdf0f3c465c046a019378e7 commit from master
branch and with 0.5.1 labextension:
jovyan@kale-master-0:~$ jupyter-labextension list
JupyterLab v2.2.9
Known labextensions:
app dir: /usr/local/share/jupyter/lab
kubeflow-kale-labextension v0.5.1 enabled OK
And when I try to open Python Notebook from JupyterLab web UI I always get the same error:
Message: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 13 Nov 2020 20:21:12 GMT', 'x-envoy-upstream-service-time': '3', 'server': 'envoy', 'transfer-encoding': 'chunked'})
HTTP response body: {"error":"Invalid input error: Invalid resource references for experiment. Namespace is empty.","message":"Invalid input error: Invalid resource references for experiment. Namespace is empty.","code":3,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Invalid resource references for experiment. Namespace is empty.","error_details":"Invalid input error: Invalid resource references for experiment. Namespace is empty."}]}
Details: You can find more information under /home/jovyan/kale.log
[I 20:21:09.221 LabApp] Kernel shutdown: be0c1ed9-7ed4-4cd9-8ded-0e745c757f90
[I 20:21:11.358 LabApp] Creating new notebook in
[I 20:21:12.006 LabApp] Kernel started: c7f526bc-fc45-4d3c-9512-4ba2b6cc1d8c
[I 20:23:11.933 LabApp] Saving file at /Untitled1.ipynb
[I 20:24:24.682 LabApp] Kernel shutdown: 39c11f71-5aeb-446a-a482-95d7496f51d6
[I 20:24:24.684 LabApp] Starting buffering for c7f526bc-fc45-4d3c-9512-4ba2b6cc1d8c:e6ce8d53-e2ba-4fd4-8b52-e2acfaf8600e
[I 20:24:27.442 LabApp] 302 GET /notebook/anonymous/kale-master/ (127.0.0.1) 1.27ms
[I 20:24:30.478 LabApp] Build is up to date
[I 20:24:30.588 LabApp] Kernel started: f19ba061-e518-472b-8bed-b7c104164bad
[I 20:24:30.732 LabApp] Starting buffering for c7f526bc-fc45-4d3c-9512-4ba2b6cc1d8c:3ee5fbd4-5b73-4c52-9abe-657742d7a92b
At the beginning I tried to use 0.5.1 release version of Kale but had the same problem. Do you have some ideas how I cat fix it ?
@mr-yaky You need to set the namespace for KFP. I am working on a PR to do this automatically but it has not been going very smoothly up to now. The easy and lazy way to do it is create the file ~/.config/kfp/context.json
and write to it {"namespace": "your_namespace"}
.
Got it! Let me check it now.
@DavidSpek hmm... after adding like this:
jovyan@kale-master-0:~$ cat ~/.config/kfp/context.json
{"namespace": "anonymous"}
I started to get new type of error:
Message: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 13 Nov 2020 20:45:07 GMT', 'x-envoy-upstream-service-time': '3', 'server': 'envoy', 'transfer-encoding': 'chunked'})
HTTP response body: {"error":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","message":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","code":10,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Request header error: there is no user identity header.","error_details":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header."}]}
Also logs from Kale pod:
2020-11-13 20:45:06 run:83 [[DEBUG]] [TID=p3in0r7yq0] [] Decoding ctx of RPC function 'nb.explore_notebook'
2020-11-13 20:45:06 run:95 [[DEBUG]] [TID=p3in0r7yq0] [/home/jovyan/Untitled2.ipynb] Decoding kwargs of RPC function 'nb.explore_notebook'
2020-11-13 20:45:06 run:104 [[DEBUG]] [TID=p3in0r7yq0] [/home/jovyan/Untitled2.ipynb] Importing RPC function 'nb.explore_notebook'
2020-11-13 20:45:06 run:114 [[INFO]] [TID=p3in0r7yq0] [/home/jovyan/Untitled2.ipynb] Executing RPC function 'explore_notebook(source_notebook_path=/home/jovyan/Untitled2.ipynb)'
2020-11-13 20:45:06 run:83 [[DEBUG]] [TID=jyrfuhcq2t] [] Decoding ctx of RPC function 'nb.get_base_image'
2020-11-13 20:45:06 run:95 [[DEBUG]] [TID=jyrfuhcq2t] [/home/jovyan/Untitled2.ipynb] Decoding kwargs of RPC function 'nb.get_base_image'
2020-11-13 20:45:06 run:104 [[DEBUG]] [TID=jyrfuhcq2t] [/home/jovyan/Untitled2.ipynb] Importing RPC function 'nb.get_base_image'
2020-11-13 20:45:06 run:114 [[INFO]] [TID=jyrfuhcq2t] [/home/jovyan/Untitled2.ipynb] Executing RPC function 'get_base_image()'
2020-11-13 20:45:07 run:83 [[DEBUG]] [TID=0h6r1vprru] [] Decoding ctx of RPC function 'nb.find_poddefault_labels_on_server'
2020-11-13 20:45:07 run:95 [[DEBUG]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] Decoding kwargs of RPC function 'nb.find_poddefault_labels_on_server'
2020-11-13 20:45:07 run:104 [[DEBUG]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] Importing RPC function 'nb.find_poddefault_labels_on_server'
2020-11-13 20:45:07 run:114 [[INFO]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] Executing RPC function 'find_poddefault_labels_on_server()'
2020-11-13 20:45:07 nb:217 [[INFO]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] Retrieving PodDefaults applied to server...
2020-11-13 20:45:07 nb:223 [[INFO]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] Retrieved applied PodDefaults: []
2020-11-13 20:45:07 nb:227 [[INFO]] [TID=0h6r1vprru] [/home/jovyan/Untitled2.ipynb] PodDefault labels applied on server:
2020-11-13 20:45:07 run:83 [[DEBUG]] [TID=eb2xixu2af] [] Decoding ctx of RPC function 'kfp.list_experiments'
2020-11-13 20:45:07 run:95 [[DEBUG]] [TID=eb2xixu2af] [/home/jovyan/Untitled2.ipynb] Decoding kwargs of RPC function 'kfp.list_experiments'
2020-11-13 20:45:07 run:104 [[DEBUG]] [TID=eb2xixu2af] [/home/jovyan/Untitled2.ipynb] Importing RPC function 'kfp.list_experiments'
2020-11-13 20:45:07 run:114 [[INFO]] [TID=eb2xixu2af] [/home/jovyan/Untitled2.ipynb] Executing RPC function 'list_experiments()'
2020-11-13 20:45:07 run:125 [[ERROR]] [TID=eb2xixu2af] [/home/jovyan/Untitled2.ipynb] RPC function 'list_experiments' raised an unhandled exception
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/kale/rpc/run.py", line 116, in run
result = func(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/kale/rpc/kfp.py", line 30, in list_experiments
for e in c.list_experiments().experiments or []]
File "/usr/local/lib/python3.6/dist-packages/kfp/_client.py", line 382, in list_experiments
resource_reference_key_id=namespace)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api/experiment_service_api.py", line 581, in list_experiment
return self.list_experiment_with_http_info(**kwargs) # noqa: E501
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api/experiment_service_api.py", line 696, in list_experiment_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api_client.py", line 383, in call_api
_preload_content, _request_timeout, _host)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api_client.py", line 202, in __call_api
raise e
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api_client.py", line 199, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/api_client.py", line 407, in request
headers=headers)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/rest.py", line 248, in GET
query_params=query_params)
File "/usr/local/lib/python3.6/dist-packages/kfp_server_api/rest.py", line 238, in request
raise ApiException(http_resp=r)
kfp_server_api.exceptions.ApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 13 Nov 2020 20:45:07 GMT', 'x-envoy-upstream-service-time': '3', 'server': 'envoy', 'transfer-encoding': 'chunked'})
HTTP response body: {"error":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","message":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","code":10,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Request header error: there is no user identity header.","error_details":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header."}]}
All right, the last problem exists only for using Kale from HEAD from master. And with release version 0.5.1 it seems like to work well. @DavidSpek Thank you one more time !
Happy to hear it is working. I haven't seen that issue before so I will look at it once I get back to Kale development.
@mr-yaky I see the same error as described by you. Although, I am using the latest image of kale HTTP response headers: HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 13 Nov 2020 20:45:07 GMT', 'x-envoy-upstream-service-time': '3', 'server': 'envoy', 'transfer-encoding': 'chunked'}) HTTP response body: {"error":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","message":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header.","code":10,"details":[{"@type":"type.googleapis.com/api.Error","error_message":"Request header error: there is no user identity header.","error_details":"Failed to authorize with API resource references: Bad request.: BadRequestError: Request header error: there is no user identity header.: Request header error: there is no user identity header."}]}
I've set the kfp config context. Am I missing something?
@mr-yaky我看到了与您描述的相同的错误。虽然,我使用羽衣甘蓝 HTTP 响应标头的最新图像: HTTPHeaderDict({'content-type': 'application/json', 'trailer': 'Grpc-Trailer-Content-Type', 'date': 'Fri, 2020 年 11 月 13 日 20:45:07 GMT', 'x-envoy-upstream-service-time': '3', 'server': 'envoy', 'transfer-encoding': 'chunked'}) HTTP 响应正文: {"error":"无法使用 API 资源引用进行授权:错误请求。:BadRequestError:请求标头错误:没有用户身份标头。:请求标头错误:没有用户身份标头。","message":"无法使用 API 资源引用进行授权:错误请求。:BadRequestError:请求标头错误:@type ":"type.googleapis.com/api.Error","error_message":"请求标头错误:没有用户身份标头。","error_details":"无法使用 API 资源引用进行授权:错误请求。 : BadRequestError: 请求标头错误: 没有用户身份标头。: 请求标头错误: 没有用户身份标头。"}]}
我已经设置了 kfp 配置上下文。我错过了什么吗?
I have the same error as described by yo
After applying a fix to allow Kale to compile, upload and run pipelines in a multi-user environment according to the steps I posted in https://github.com/kubeflow-kale/kale/issues/204#issuecomment-694771168, I receive the following error that Kale can't properly list experiments as the namespace field is empty. Despite this error, Kale is able to upload the pipeline. However manually running that pipeline results in a HTTP error 403 during the
Getting workflow
step.kale.log showing the function 'list_experiments' raised an unhandled exception
Error when running the kale created pipeline: