Open neetikasinghal opened 1 year ago
The mappings or data having these broken characters should either not be permitted or be replaced at the time of ingestion.
I think we need to define the desired behavior here, and then we can work through any compatibility concerns. Lucene itself silently replaces a broken surrogate with U+FFFD, but for things like index mapping this probably does not make sense as demonstrated by the case of trying to index both U+D800
and U+FFFD
and getting an error because U+D800
is silently changed to be a duplicate. My inclination is to fail with a mapper_parsing_exception
any time a mapping contains a broken surrogate. @neetikasinghal Do you have an opinion on the right way to go here?
Regarding the "or data" part of the above statement, I'm less sure we should be so strict with the data. As long as OpenSearch itself doesn't need to parse or understand the content (like it does in the case of field mappings), then I think we can probably just let Lucene do what it does and not add any overhead to attempt to parse or validate separately. I would love to hear opinions from other folks about this though.
Describe the bug OpenSearch accepts broken surrogates and leads to the following determined failures:
There is fair amount of discussion that has happened on the pr, where it is evident that having smile encoding with LENIENT_UTF_ENCODING to pass the snapshot creation is not right as it would end up changing the data after the restore. Hence, we need to fix this such that even before the data reaches the snapshot creation stage, it is fixed.
To Reproduce
There is no result returned:
Expected behavior The mappings or data having these broken characters should either not be permitted or be replaced at the time of ingestion.