opensearch-project / flow-framework

OpenSearch plugin that enables builders to innovate AI apps on OpenSearch
Apache License 2.0
33 stars 36 forks source link

[BUG] Provision failure returns incorrect error message #640

Closed ohltyler closed 7 months ago

ohltyler commented 8 months ago

If a workflow has a runtime failure when provisioning (e.g., text_embedding processor doesn't exist when running create_ingest_pipeline step on a cluster with no neural-search plugin installed), the state is correctly set to FAILED with a relevant error message. Example:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "FAILED"
}

However, when updating the cluster to address the runtime failure, and when re-running the provision API, it fails and returns an incorrect message:

{
  "error": "The template has already been provisioned: LGKGn44B51-soDq-mga1"
}

This should be updated to a more relevant message, along the lines of "The workflow failed provisioning, please deprovision and try again".

Note that after manually running the deprovision API, and re-running provision, it works as expected. But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

ohltyler commented 8 months ago

But feels weird to have to run deprovision when it's doing nothing besides reset the state, since there was no created resources to clean up to begin with.

I don't have a good solution for this part off the top of my head. But from a user perspective this is a bit confusing. Worth exploring this edge case from usability standpoint

dbwiddis commented 8 months ago

See https://github.com/opensearch-project/flow-framework/issues/537 which requires the same solution.

ohltyler commented 8 months ago

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

{
  "workflow_id": "LGKGn44B51-soDq-mga1",
  "error": "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST",
  "state": "COMPLETED",
  "provisioning_progress": "DONE",
  "provision_start_time": 1712074192234,
  "provision_end_time": 1712074192545,
  "resources_created": [
    {
      "resource_type": "pipeline_id",
      "resource_id": "test-pipeline",
      "workflow_step_name": "create_ingest_pipeline",
      "workflow_step_id": "create_ingest_pipeline"
    },
    {
      "resource_type": "index_name",
      "resource_id": "my-knn-index",
      "workflow_step_name": "create_index",
      "workflow_step_id": "create_index"
    }
  ]
}
dbwiddis commented 8 months ago

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

ohltyler commented 8 months ago

Additional issue found: the error field looks like it is not cleared, even when deprovisioning and reprovisioning:

If you review and approve #635 that problem will go away :-)

Ha! Perfect

amitgalitz commented 8 months ago

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}
amitgalitz commented 7 months ago

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster.

Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}

this is hidden on our error message.

amitgalitz commented 7 months ago

on a similar note should we improve to add more details in the error message itself: "org.opensearch.flowframework.exception.WorkflowStepException during step create_ingest_pipeline, restStatus: BAD_REQUEST" we use WorkflowStepException to let us know the step itself is the reason for the error, but we don't propagate outside the logs the fact that the issue is that we don't have text_embedding processor available in the cluster. Today if a user hits the create ingest pipeline API directly they get more information without having to look at the logs:

{
    "error": {
        "root_cause": [
            {
                "type": "parse_exception",
                "reason": "No processor type exists with name [text_embeddinssg]",
                "processor_type": "text_embeddinssg"
            }
        ],
        "type": "parse_exception",
        "reason": "No processor type exists with name [text_embeddinssg]",
        "processor_type": "text_embeddinssg"
    },
    "status": 400
}

@dbwiddis @ohltyler @owaiskazi19 @jackiehanyang @joshpalis Can i get some input on this, I feel like its a very big deal actually, since there are many other examples that if users used the current API they are missing on the exception that is given to the user. Another example if we register a local model in ML without changing settings we get this back:

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "No eligible node found to execute this request. It's best practice to provision ML nodes to serve your models. You can disable this setting to serve the model on your data node for development purposes by disabling the \"plugins.ml_commons.only_run_on_ml_node\" configuration using the _cluster/setting api"
    },
    "status": 400
}

this is hidden on our error message.

created new issue for this: https://github.com/opensearch-project/flow-framework/issues/670

dbwiddis commented 7 months ago

This issue was completed in #642. New issue raised in reopen comment is being tracked in #670.