terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

recovery mapping collisions #3379

Closed jsnoble closed 11 months ago

jsnoble commented 1 year ago

I am coming across a race condition that an execution ran under recovery has a mapping collusion with its slicer_stats object and fails to be marked complete when its done. Currently this seems to only manifest with opensearch v2.

From the error it seems the value queuing_complete is instantiated with an empty string, but throws when a date is given at the end.

I checked the mapping for ex store and _slicer_stats is set to type object, but currently there are no schema values for all the keys

            workers_available: 0,
            workers_active: 0,
            workers_joined: 0,
            workers_reconnected: 0,
            workers_disconnected: 0,
            job_duration: 0,
            failed: 0,
            subslices: 0,
            queued: 0,
            slice_range_expansion: 0,
            processed: 0,
            slicers: 0,
            subslice_by_key: 0,
            started: '',
            queuing_complete: ''

These are the initial values

I think it would be wise to add the fields to the schema, just don't know if it will have any additional effects to anything old

The alternative would be to cull any empty values when they are saved to the execution context so it only updates when it has a real value

godber commented 1 year ago

Can you link to the test or tests are failing and the test runs (assuming they're in Github Actions) that have failed?

godber commented 1 year ago

I think this might be an example of the failure:

https://github.com/terascope/teraslice/actions/runs/5600622766/jobs/10243288784#step:9:150

godber commented 1 year ago

And this the failing test code:

https://github.com/terascope/teraslice/blob/3d0259c9d3f80d0211d3ab6a6ed1a13abf703f51/packages/teraslice/test/workers/execution-controller/execution-special-test-cases-spec.js#L568

godber commented 1 year ago

Comparing two active Opensearch state clusters I can see that the schema on the Opensearch 1 has the following for _slicer_stats

        "_slicer_stats": {
          "type": "object"
        },

And the Opensearch 2 cluster has:

        "_slicer_stats": {
          "properties": {
            "failed": {
              "type": "long"
            },
            "job_duration": {
              "type": "long"
            },
            "processed": {
              "type": "long"
            },
            "queued": {
              "type": "long"
            },
            "queuing_complete": {
              "type": "date"
            },
           ...
          }
        },

So it is as if Opensearch 2 infers types in nested objects wheras Opensearch 1 did not. I think you were on to this already Jared.

godber commented 1 year ago

Wouldn't you still have this issue if you added a mapping? Since the mapping you add would be type date, and the code is still (presumably) trying to write a queuing_complete = '' to the field, I would expect that it fails, right?

jsnoble commented 1 year ago

Part of the solution is to set that to undefined, as there is no good reason to instantiate that field to an empty string. I am trying to check if there are any ramifications to this.

godber commented 1 year ago

We have tested a typical recovery scenario on OpenSearch 2.8 and it works correctly.

godber commented 1 year ago

I don't really see a difference in behavior between OpenSearch 1.3.6 and OpenSearch 2.8.0. My test was as follows:

GET /

PUT /testindex1
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 1
    }
  },
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "patient": {
        "type": "object"
      }      
    }
  }
}

GET testindex1/_mapping

#{
# "testindex1" : {
#    "mappings" : {
#      "properties" : {
#        "age" : {
#          "type" : "integer"
#        },
#        "patient" : {
#          "type" : "object"
#        }
#      }
#    }
#  }
#}

PUT testindex1/_doc/1
{ 
  "age": 5,
  "office": "Mesa",
  "patient": { 
    "name" : "John Doe",
    "id" : "123456"
  }
}

GET testindex1/_mapping

# {
#   "testindex1" : {
#     "mappings" : {
#       "properties" : {
#         "age" : {
#           "type" : "integer"
#         },
#         "office" : {
#           "type" : "text",
#           "fields" : {
#             "keyword" : {
#               "type" : "keyword",
#               "ignore_above" : 256
#             }
#           }
#         },
#         "patient" : {
#           "properties" : {
#             "id" : {
#               "type" : "text",
#               "fields" : {
#                 "keyword" : {
#                   "type" : "keyword",
#                   "ignore_above" : 256
#                 }
#               }
#             },
#             "name" : {
#               "type" : "text",
#               "fields" : {
#                 "keyword" : {
#                   "type" : "keyword",
#                   "ignore_above" : 256
#                 }
#               }
#             }
#           }
#         }
#       }
#     }
#   }
# }

PUT testindex1/_doc/2
{ 
  "age": 50,
  "office": "Mesa",
  "patient": { 
    "name" : "Jane Doe",
    "id" : "123456",
    "birthdate": "1973-07-06"
  }
}

GET testindex1

PUT testindex1/_doc/3
{ 
  "age": 50,
  "office": "Mesa",
  "patient": { 
    "name" : "Julie Doe",
    "id" : "123456",
    "birthdate": ""
  }
}

GET testindex1

DELETE testindex1

In both cases when I try to insert record 3 OpenSearch returns an error:

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [patient.birthdate] of type [date] in document with id '3'. Preview of field's value: ''"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [patient.birthdate] of type [date] in document with id '3'. Preview of field's value: ''",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "cannot parse empty date"
    }
  },
  "status": 400
}
godber commented 1 year ago

Though inserting a birthdate of null does work:

PUT testindex1/_doc/3
{ 
  "age": 50,
  "office": "Mesa",
  "patient": { 
    "name" : "Julie Doe",
    "id" : "123456",
    "birthdate": null
  }
}

Which we would expect given the docs discussion of null handling:

https://opensearch.org/docs/2.8/field-types/supported-field-types/date/

godber commented 1 year ago

Oh, I've just realized the difference between my test above ... and the actual ex indices ... I forgot that we set "dynamic": "false" on our indices. So on an opensearc 1.3.6 state cluster the ex index is:

{
  "teraslice-1__ex": {
    "mappings": {
      "dynamic": "false",
      "properties": {
        "_context": {
          "type": "keyword"
        },
        "_created": {
          "type": "date"
        },
        "_has_errors": {
          "type": "keyword"
        },
        "_slicer_stats": {
          "type": "object"
        },
        "_status": {
          "type": "keyword"
        },
        "_updated": {
          "type": "date"
        },
        "active": {
          "type": "boolean"
        },
        "ex_id": {
          "type": "keyword"
        },
        "job_id": {
          "type": "keyword"
        },
        "metadata": {
          "type": "object",
          "enabled": false
        },
        "recovered_execution": {
          "type": "keyword"
        },
        "recovered_slice_type": {
          "type": "keyword"
        },
        "slicer_hostname": {
          "type": "keyword"
        },
        "slicer_port": {
          "type": "keyword"
        }
      }
    }
  }
}
godber commented 1 year ago

Ok, I've confirmed on Opensearch 1.3.6 that when the dynamic: false setting is set at the top level of the mapping it still applies to fields on nested objects. Of course in the scenario above, that allows testindex1/_doc/3 to be created with an empty string for a birthdate since there is no type stored at that point.

godber commented 1 year ago

I think the conclusion here is that the mapping for _slicer_stats (here https://github.com/terascope/teraslice/pull/3376/files#diff-7c4a98523fd62bcda7fdd1fe4080de6892ea62def582b4428f48e10752fbb7efR46-R90) should be removed if these values are not searched by anything (including the recovery process). That being said should we also exclude the:

"_slicer_stats": {
  "type": "object"
}

Mapping?

godber commented 1 year ago

Brien points out that at least if we declare _slicer_stats to be an object in the mapping but don't include it's subfields ... it may not index anything for a _slicer_stats but it will probably do type checking. Yeah, thats what it does, if you have type object then try and write an integer into the field you get the following error:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "object mapping for [patient] tried to parse field [patient] as object, but found a concrete value"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "object mapping for [patient] tried to parse field [patient] as object, but found a concrete value"
  },
  "status" : 400
}
godber commented 11 months ago

I noted that this PR closes this issue:

https://github.com/terascope/teraslice/pull/3376#issuecomment-1644352038

closing.