uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
8.14k stars 786 forks source link

Elasticsearch query error: cadence internal error, msg: ListWorkflowExecutions failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception] #3374

Closed avarma2053 closed 4 years ago

avarma2053 commented 4 years ago

Describe the bug Query from cadence Web and cadence cli to elasticsearch is failing with "cadence internal error, msg: ListWorkflowExecutions failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception]". After debugging it further I found that RunID is used for sorting purposes. RunID is a text field and elastic search does not allow by default to use a text field for aggregation and sorting purpose. https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

Detailed elasticsearch error: {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [RunID] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"cadence-visibility-dev","node":"wLpGV-fdRFWaEz2yzPBsSw","reason":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [RunID] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}}],"caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [RunID] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [RunID] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."}}},"status":400}

To Reproduce Is the issue reproducible?

Steps to reproduce the behavior: I m using 0.12.0 cadence release. Deployed cadence using banzai cloud cadence helm chart.

  1. Enable elasticsearch advanced visibility
  2. Add kafka configurations
  3. Run some workflows so it will be saved in elasticseach
  4. Open cadence Web on that domain and you will see this error: cadence internal error, msg: ListWorkflowExecutions failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception].

Expected behavior A list of workflows Open/Close with runid , start time, close time details

Screenshots NA

Additional context Cadence frontend server logs: {"level":"info","ts":"2020-07-02T14:31:24.761Z","msg":"List open workflow with filter","service":"cadence-frontend","wf-domain-name":"test-domain","wf-list-filter-type":"WID","logging-call-at":"workflowHandler.go:2850"} {"level":"error","ts":"2020-07-02T14:31:24.761Z","msg":"Internal service error","service":"cadence-frontend","error":"InternalServiceError{Message: ListOpenWorkflowExecutionsByWorkflowID failed. Error: elastic: Error 400 (Bad Request): all shards failed [type=search_phase_execution_exception]}","logging-call-at":"workflowHandler.go:3772","stacktrace":"github.com/uber/cadence/common/log/loggerimpl.(*loggerImpl).Error\n\t/cadence/common/log/loggerimpl/logger.go:134\ngithub.com/uber/cadence/service/frontend.(*WorkflowHandler).error\n\t/cadence/service/frontend/workflowHandler.go:3772\ngithub.com/uber/cadence/service/frontend.(*WorkflowHandler).ListOpenWorkflowExecutions\n\t/cadence/service/frontend/workflowHandler.go:2868\ngithub.com/uber/cadence/service/frontend.(*DCRedirectionHandlerImpl).ListOpenWorkflowExecutions.func2\n\t/cadence/service/frontend/dcRedirectionHandler.go:415\ngithub.com/uber/cadence/service/frontend.(*NoopRedirectionPolicy).WithDomainNameRedirect\n\t/cadence/service/frontend/dcRedirectionPolicy.go:112\ngithub.com/uber/cadence/service/frontend.(*DCRedirectionHandlerImpl).ListOpenWorkflowExecutions\n\t/cadence/service/frontend/dcRedirectionHandler.go:411\ngithub.com/uber/cadence/service/frontend.(*AccessControlledWorkflowHandler).ListOpenWorkflowExecutions\n\t/cadence/service/frontend/accessControlledHandler.go:346\ngithub.com/uber/cadence/.gen/go/cadence/workflowserviceserver.handler.ListOpenWorkflowExecutions\n\t/cadence/.gen/go/cadence/workflowserviceserver/server.go:932\ngo.uber.org/yarpc/encoding/thrift.thriftUnaryHandler.Handle\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/encoding/thrift/inbound.go:61\ngo.uber.org/yarpc/internal/observability.(*Middleware).Handle\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/internal/observability/middleware.go:141\ngo.uber.org/yarpc/api/middleware.unaryHandlerWithMiddleware.Handle\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/api/middleware/inbound.go:71\ngo.uber.org/yarpc/api/transport.InvokeUnaryHandler\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/api/transport/handler_invoker.go:70\ngo.uber.org/yarpc/transport/tchannel.handler.callHandler\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/transport/tchannel/handler.go:215\ngo.uber.org/yarpc/transport/tchannel.handler.handle\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/transport/tchannel/handler.go:118\ngo.uber.org/yarpc/transport/tchannel.handler.Handle\n\t/go/pkg/mod/go.uber.org/yarpc@v1.42.0/transport/tchannel/handler.go:107\ngithub.com/uber/tchannel-go.channelHandler.Handle\n\t/go/pkg/mod/github.com/uber/tchannel-go@v1.16.0/handlers.go:126\ngithub.com/uber/tchannel-go.(*Connection).dispatchInbound\n\t/go/pkg/mod/github.com/uber/tchannel-go@v1.16.0/inbound.go:203"}

vancexu commented 4 years ago

Hi @avarma2053, can you try ES v6? Cadence is running on ES 6.5.1 OSS, and that's the version we fully tested. We haven't get chance to try latest ES version so unsure about the new breaking change there.

BTW, RunID is of type keyword instead of text.

avarma2053 commented 4 years ago

Hi @vancexu I tried this with ES 6.8.8. I am getting this error in the same. And yes RunID is the Keyword type of text. Here are my index details: { "cadence-visibility": { "aliases": {}, "mappings": { "_doc": { "properties": { "Attr": { "properties": { "BinaryChecksums": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } }, "CloseStatus": { "type": "long" }, "CloseTime": { "type": "long" }, "DomainID": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "ExecutionTime": { "type": "long" }, "HistoryLength": { "type": "long" }, "KafkaKey": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "RunID": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "StartTime": { "type": "long" }, "WorkflowID": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "WorkflowType": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } } } }, "settings": { "index": { "creation_date": "1593629104948", "number_of_shards": "5", "number_of_replicas": "1", "uuid": "lVU0UmefRS2-Q9Nz5ajPTQ", "version": { "created": "6080899" }, "provided_name": "cadence-visibility" } } } }

vancexu commented 4 years ago

@avarma2053 I feel the use of field in your index might causing this error. Can you try index like:

{
  "cadence-visibility-dev" : {
    "aliases" : { },
    "mappings" : {
      "_doc" : {
        "dynamic" : "false",
        "properties" : {
          "Attr" : {
            "properties" : {
              "CadenceChangeVersion" : {
                "type" : "keyword"
              },
              "CustomBoolField" : {
                "type" : "boolean"
              },
              "CustomDatetimeField" : {
                "type" : "date"
              },
              "CustomDoubleField" : {
                "type" : "double"
              },
              "CustomIntField" : {
                "type" : "long"
              },
              "CustomKeywordField" : {
                "type" : "keyword"
              },
              "CustomStringField" : {
                "type" : "text"
              },
              "RolloutID" : {
                "type" : "keyword"
              }
            }
          },
          "CloseStatus" : {
            "type" : "integer"
          },
          "CloseTime" : {
            "type" : "long"
          },
          "DomainID" : {
            "type" : "keyword"
          },
          "ExecutionTime" : {
            "type" : "long"
          },
          "HistoryLength" : {
            "type" : "integer"
          },
          "KafkaKey" : {
            "type" : "keyword"
          },
          "RunID" : {
            "type" : "keyword"
          },
          "StartTime" : {
            "type" : "long"
          },
          "WorkflowID" : {
            "type" : "keyword"
          },
          "WorkflowType" : {
            "type" : "keyword"
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1563476326142",
        "number_of_shards" : "5",
        "number_of_replicas" : "0",
        "uuid" : "9W-WY7y1SymnvWjFriN6rg",
        "version" : {
          "created" : "6050199"
        },
        "provided_name" : "cadence-visibility-dev"
      }
    }
  }
}
avarma2053 commented 4 years ago

Interesting... Because I didn't create the index either cadence did or elastic search feature. Sure I will test with the index you suggested. One more thing I didn't understand in the code why RunId used for sorting when retrieving the workflows. Because RunId can be anything like UUID also.

vancexu commented 4 years ago

ES search API with from + size doesn't support deep page retrieval (default 10k docs) so we use a mix of that API with SearchAfter API to better support Cadence visibility search. SearchAfter requires a tiebreaker in sort field, where RunID is chosen because it is meaningful and unique. Even though RunID is UUID, but the performance impact should be limited because it's only tiebreaker. (Performance test shows latency increase from 200ms to 400ms when change from + size to searchAfter after adding tieBreaker)