Closed Haiperlink closed 7 years ago
Either the AWS documentation is incomplete or there is something else is going on. The regular expression to remove invalid characters /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ does not matches \u0007. Probably will be a good idea to find out what is actually going on first.
I would say it does. Its ^ reverses the match. So it matches every allowed character except \u0009, \u000a, \u000d, ... Everything else should be removed. Which is also explained in the Text:
Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), [...]
Yes you are right I missed the ^
May be a convenient method to auto remove invalid characters: public void addFieldAutoRemoveInvalidChars(String name, String value)
Although the method name is an eyesore.
There is something missing in the regexp amazon provides and it's the global flag g. So the complete regexp should be
JavaScript RegExp syntax /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g
The difference is that if you don't add the g flag, the replacement will happen only in the first match of the invalid characters, not all of them.
One annoying thing about this from Amazon is that whilst warnings show which document was erroring, errors don't, so I can't identify the specific document that has the problem.
I've stumbled over an issue with utf-8 chars which seam to be not allowed in cloud search. AWS is returning an exception on requests. My message contained the \u0007 character "BELL" https://codepoints.net/U+0007?lang=en As far as I understood this is because these characters are not allowed in xml.
Amazon recommends the following:
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html
As AWS is already throwing an exception I don't know if you want to do anything about it or if you want to leave it to the user of the framework. I just wanted to raise the topic.