Closed ekesken closed 4 years ago
/assign @numerology to help identify the work required and assign priority for the issue.
Hi @ekesken thanks for reporting! Just to confirm, are you using kfp.Client().create_recurring_run
to launch the scheduled workflow or something else?
Unfortunately we don't know how this manifest was created exactly, that's why we couldn't reproduce the problem.
The user was using a code block like this one in his notebook to play with things:
# Submit a pipeline run
from kale.common.kfputils import generate_run_name
run_name = generate_run_name('append-pipeline-fixed-pq1q2')
run_result = client.run_pipeline(
experiment.id, run_name, pipeline_filename, {})
recurrent_run_name = generate_run_name('append-pipeline-fixed-recurrent-pq1q2')
run_recurrent_result = client.create_recurring_run(experiment.id, recurrent_run_name,
start_time='2020-11-06T00:00:00.00Z',
end_time='2020-11-06T02:00:00.00Z',
cron_expression='*/10 * * * *',
pipeline_package_path=pipeline_filename)
But not with that parameters, he was trying various things, he also reported he had used pipelines UI many times to create recurring runs, but he couldn't repeat the invalid character issue again neither with UI nor with kfp client.
client.create_recurring_run
causes following error:
# python append-pipeline-fixed-pq1q2.kale.py
Traceback (most recent call last):
File "append-pipeline-fixed-pq1q2.kale.py", line 298, in <module>
run_recurrent_result = client.create_recurring_run(experiment.id, recurrent_run_name, end_time='2020-11-09T14:00:00.00Z', interval_second="10:0", pipeline_package_path=pipeline_filename)
File "/usr/local/lib/python3.7/dist-packages/kfp/_client.py", line 499, in create_recurring_run
return self._job_api.create_job(body=job_body)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api/job_service_api.py", line 79, in create_job
return self.create_job_with_http_info(body, **kwargs) # noqa: E501
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api/job_service_api.py", line 177, in create_job_with_http_info
collection_formats=collection_formats)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 383, in call_api
_preload_content, _request_timeout, _host)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 202, in __call_api
raise e
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 199, in __call_api
_request_timeout=_request_timeout)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 427, in request
body=body)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/rest.py", line 285, in POST
body=body)
File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/rest.py", line 238, in request
raise ApiException(http_resp=r)
kfp_server_api.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Mon, 09 Nov 2020 10:37:01 GMT', 'Content-Length': '120'})
HTTP response body: {"error":"invalid character ':' after top-level value","message":"invalid character ':' after top-level value","code":3}
So probably he did it during a manual edit with kubectl. That's why, what I ask for is having the openAPIV3Schema
validation configuration in CRD spec instead to be sure such a spec can't be applied in any way disregarding from where it comes. there is no validation configuration in CRD right now: https://github.com/kubeflow/pipelines/blob/1.1.0-alpha.1/backend/src/crd/install/manifests/scheduledworkflow-crd.yaml
You can see an example usage here: https://github.com/kubeflow/pipelines/blob/1.1.0-alpha.1/manifests/kustomize/base/application/cluster-scoped/application-crd.yaml#L14
Hi @ekesken, just created an issue summarizing our vision for scheduled workflow: https://github.com/kubeflow/pipelines/issues/4752
I don't think it's worth it investing more on it, instead of using an existing cron job implementation like the kubernetes one.
What steps did you take:
Used a non-integer value for the
spec.trigger.periodicSchedule.intervalSecond
field in a scheduledworkflow.What happened:
Persistence agent begins logging errors every second like this one:
It's unnecessarily filling the disk with these error logs for a scheduledworkflow that would never be executed.
What did you expect to happen:
I would expect not to be allowed to apply such a spec.
Environment:
How did you deploy Kubeflow Pipelines (KFP)?
We have our kustomize overlays over the manifests coming from https://github.com/kubeflow/manifests/archive/v1.1.0.tar.gz, we only install pipelines and metadata components with their requirements. We're working on a EKS cluster (v1.15.11-eks-065dce)
KFP version: https://github.com/kubeflow/pipelines/commit/988f5b02e4211dfff1c02eb0b9a52cbc69793364
Anything else you would like to add:
We were having issues in pipelines UI, any new
Run
attempt was ending up with a forever spinning icon in the UI without showing the nodes, we realised that in/apis/v1beta1/runs/<run-id>
response,pipeline_runtime.workflow_manifest
field always has the status{"startedAt":null,"finishedAt":null}
, but with kubectl we were seeing the status in the corresponding workflow was updated successfully, then we saw these error logs about unexpected characters in a scheduledworkflow spec, we deleted the problemetic scheduledworkflow object and after that all the new and previous run statuses began to be updated properly and seen in UI without problem.Unfortunately we couldn't repeat this case again, and we don't have the spec in our hand that caused this problem, but anyway if we had the schema validation, the situation that triggers this bug would never occur. The problematic spec was created with Kale.