it's possible to create scheduledworkflow that will never be executed due to missing schema validation

ekesken commented 4 years ago

What steps did you take:

Used a non-integer value for the spec.trigger.periodicSchedule.intervalSecond field in a scheduledworkflow.

What happened:

Persistence agent begins logging errors every second like this one:

E1105 23:09:17.190195       1 reflector.go:205] pkg/mod/k8s.io/client-go@v0.0.0-20180718001006-59698c7d9724/tools/cache/reflector.go:99: Failed to list *v1beta1.ScheduledWorkflow: v1beta1.ScheduledWorkflowList.Items: []v1beta1.ScheduledWorkflow: v1beta1.ScheduledWorkflow.Spec: v1beta1.ScheduledWorkflowSpec.Trigger: v1beta1.Trigger.PeriodicSchedule: v1beta1.PeriodicSchedule.IntervalSecond: readUint64: unexpected character: �, error found in #10 byte of ...|lSecond":"10:0"}},"w|..., bigger context ...|,"trigger":{"periodicSchedule":{"intervalSecond":"10:0"}},"workflow":{"spec":{"arguments":{},"entryp|...

It's unnecessarily filling the disk with these error logs for a scheduledworkflow that would never be executed.

What did you expect to happen:

I would expect not to be allowed to apply such a spec.

Environment:

How did you deploy Kubeflow Pipelines (KFP)?

We have our kustomize overlays over the manifests coming from https://github.com/kubeflow/manifests/archive/v1.1.0.tar.gz, we only install pipelines and metadata components with their requirements. We're working on a EKS cluster (v1.15.11-eks-065dce)

KFP version: https://github.com/kubeflow/pipelines/commit/988f5b02e4211dfff1c02eb0b9a52cbc69793364

Anything else you would like to add:

We were having issues in pipelines UI, any new Run attempt was ending up with a forever spinning icon in the UI without showing the nodes, we realised that in /apis/v1beta1/runs/<run-id> response, pipeline_runtime.workflow_manifest field always has the status {"startedAt":null,"finishedAt":null}, but with kubectl we were seeing the status in the corresponding workflow was updated successfully, then we saw these error logs about unexpected characters in a scheduledworkflow spec, we deleted the problemetic scheduledworkflow object and after that all the new and previous run statuses began to be updated properly and seen in UI without problem.

Unfortunately we couldn't repeat this case again, and we don't have the spec in our hand that caused this problem, but anyway if we had the schema validation, the situation that triggers this bug would never occur. The problematic spec was created with Kale.

dushyanthsc commented 4 years ago

/assign @numerology to help identify the work required and assign priority for the issue.

numerology commented 4 years ago

Hi @ekesken thanks for reporting! Just to confirm, are you using kfp.Client().create_recurring_run to launch the scheduled workflow or something else?

ekesken commented 4 years ago

Unfortunately we don't know how this manifest was created exactly, that's why we couldn't reproduce the problem.

The user was using a code block like this one in his notebook to play with things:

    # Submit a pipeline run
    from kale.common.kfputils import generate_run_name
    run_name = generate_run_name('append-pipeline-fixed-pq1q2')
    run_result = client.run_pipeline(
        experiment.id, run_name, pipeline_filename, {})
    recurrent_run_name = generate_run_name('append-pipeline-fixed-recurrent-pq1q2')
    run_recurrent_result = client.create_recurring_run(experiment.id, recurrent_run_name,
                                                       start_time='2020-11-06T00:00:00.00Z',
                                                       end_time='2020-11-06T02:00:00.00Z',
                                                       cron_expression='*/10 * * * *',
                                                       pipeline_package_path=pipeline_filename)

But not with that parameters, he was trying various things, he also reported he had used pipelines UI many times to create recurring runs, but he couldn't repeat the invalid character issue again neither with UI nor with kfp client.

client.create_recurring_run causes following error:

# python append-pipeline-fixed-pq1q2.kale.py 
Traceback (most recent call last):
  File "append-pipeline-fixed-pq1q2.kale.py", line 298, in <module>
    run_recurrent_result = client.create_recurring_run(experiment.id, recurrent_run_name, end_time='2020-11-09T14:00:00.00Z', interval_second="10:0", pipeline_package_path=pipeline_filename)    
  File "/usr/local/lib/python3.7/dist-packages/kfp/_client.py", line 499, in create_recurring_run
    return self._job_api.create_job(body=job_body)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api/job_service_api.py", line 79, in create_job
    return self.create_job_with_http_info(body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api/job_service_api.py", line 177, in create_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 383, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 202, in __call_api
    raise e
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 199, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/api_client.py", line 427, in request
    body=body)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/rest.py", line 285, in POST
    body=body)
  File "/usr/local/lib/python3.7/dist-packages/kfp_server_api/rest.py", line 238, in request
    raise ApiException(http_resp=r)
kfp_server_api.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'Date': 'Mon, 09 Nov 2020 10:37:01 GMT', 'Content-Length': '120'})
HTTP response body: {"error":"invalid character ':' after top-level value","message":"invalid character ':' after top-level value","code":3}

So probably he did it during a manual edit with kubectl. That's why, what I ask for is having the openAPIV3Schema validation configuration in CRD spec instead to be sure such a spec can't be applied in any way disregarding from where it comes. there is no validation configuration in CRD right now: https://github.com/kubeflow/pipelines/blob/1.1.0-alpha.1/backend/src/crd/install/manifests/scheduledworkflow-crd.yaml

You can see an example usage here: https://github.com/kubeflow/pipelines/blob/1.1.0-alpha.1/manifests/kustomize/base/application/cluster-scoped/application-crd.yaml#L14

Bobgy commented 4 years ago

Hi @ekesken, just created an issue summarizing our vision for scheduled workflow: https://github.com/kubeflow/pipelines/issues/4752

I don't think it's worth it investing more on it, instead of using an existing cron job implementation like the kubernetes one.

kubeflow / pipelines