[BUG] OpenSearch accepts broken surrogates with unpredictable results

Describe the bug OpenSearch accepts broken surrogates and leads to the following determined failures:

Snapshot creation fails with the exception highlighted in #6453.
Term match search on the field name with a broken surrogate, does not find any results if a document is created with a broken surrogate as a field mapping.
Indexing a broken surrogate pair with a replacement character leads to a red index with an exception stated in link

There is fair amount of discussion that has happened on the pr, where it is evident that having smile encoding with LENIENT_UTF_ENCODING to pass the snapshot creation is not right as it would end up changing the data after the restore. Hence, we need to fix this such that even before the data reaches the snapshot creation stage, it is fixed.

To Reproduce

Create an index by documenting a broken surrogate pair:

curl -X POST "localhost:9200/index-2/_doc/?pretty" -H 'Content-Type: application/json' -d'
{
"\uDDFD":12
}
'

Try to do a term search:

curl  -H 'Content-Type: application/json' localhost:9200/index-2/_search?pretty -d '{
                   "query" : {
                       "term" : {
                           "\uDDFD": 12
                       }
                   }
               }'

There is no result returned:

{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
  "value" : 0,
  "relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}

Try to take a snapshot of this index
Snapshot fails with the exception highlighted in https://github.com/opensearch-project/OpenSearch/issues/6453#issue-1596094403

Expected behavior The mappings or data having these broken characters should either not be permitted or be replaced at the time of ingestion.

The mappings or data having these broken characters should either not be permitted or be replaced at the time of ingestion.

I think we need to define the desired behavior here, and then we can work through any compatibility concerns. Lucene itself silently replaces a broken surrogate with U+FFFD, but for things like index mapping this probably does not make sense as demonstrated by the case of trying to index both U+D800 and U+FFFD and getting an error because U+D800 is silently changed to be a duplicate. My inclination is to fail with a mapper_parsing_exception any time a mapping contains a broken surrogate. @neetikasinghal Do you have an opinion on the right way to go here?

Regarding the "or data" part of the above statement, I'm less sure we should be so strict with the data. As long as OpenSearch itself doesn't need to parse or understand the content (like it does in the case of field mappings), then I think we can probably just let Lucene do what it does and not add any overhead to attempt to parse or validate separately. I would love to hear opinions from other folks about this though.

opensearch-project / OpenSearch

[BUG] OpenSearch accepts broken surrogates with unpredictable results #8541