Is your feature request related to a problem? Please describe.
Currently monitoring of workflow failures is not done. So if a call to VAI for example fails then the workflow finishes and no one is the wiser.
Currently the various workflows KFP Operator triggers start a pod for the provider image which provides a CLI for the various bits of functionality.
e.g. /provider --provider <config location> pipeline create --pipeline-definition /resource-definition.yaml
To allow for better monitoring of provider events turning this step into calling an api on a running deployment that can expose metrics (using open telemetry) (rather than relying on publishing metrics to the argo workflow controller, which we have already experienced is not the best solution).
Describe the solution you'd like
We have the event source server/event processor for handling the run completion events as a deployment already. Idea would be to add a http rest interface to this deployment to handle the different CLI commands that are currently made to be CRUD HTTP requests on the running deployment. The workflows will then all need to be changed so that they make http requests out rather than start a pod up. A metrics endpoint should also be added to expose metrics for the various events that are processed and whether was successful or not. Analysis should be carried to see what useful metrics would be for the different endpoints.
--provider = this is the custom resource so service will still need to load the resource to get the config as it currently does. Provider name will be a template var.
The different resource definitions will be passed in the body of the request.
Is your feature request related to a problem? Please describe. Currently monitoring of workflow failures is not done. So if a call to VAI for example fails then the workflow finishes and no one is the wiser.
Currently the various workflows KFP Operator triggers start a pod for the provider image which provides a CLI for the various bits of functionality.
e.g.
/provider --provider <config location> pipeline create --pipeline-definition /resource-definition.yaml
To allow for better monitoring of provider events turning this step into calling an api on a running deployment that can expose metrics (using open telemetry) (rather than relying on publishing metrics to the argo workflow controller, which we have already experienced is not the best solution).
Describe the solution you'd like We have the event source server/event processor for handling the run completion events as a deployment already. Idea would be to add a http rest interface to this deployment to handle the different CLI commands that are currently made to be CRUD HTTP requests on the running deployment. The workflows will then all need to be changed so that they make http requests out rather than start a pod up. A metrics endpoint should also be added to expose metrics for the various events that are processed and whether was successful or not. Analysis should be carried to see what useful metrics would be for the different endpoints.
--provider = this is the custom resource so service will still need to load the resource to get the config as it currently does. Provider name will be a template var.
The different resource definitions will be passed in the body of the request.
/pipeline /run /schedule /experiment
PUT = Create POST = Update DELETE = Delete