tahseen / amazon-cloudsearch-client-java

Amazon CloudSearch Client for Document Service and Searching
18 stars 22 forks source link

Error when using UTF-8 characters that are not allowed in XML #5

Closed Haiperlink closed 7 years ago

Haiperlink commented 9 years ago

I've stumbled over an issue with utf-8 chars which seam to be not allowed in cloud search. AWS is returning an exception on requests. My message contained the \u0007 character "BELL" https://codepoints.net/U+0007?lang=en As far as I understood this is because these characters are not allowed in xml.

aws.services.cloudsearchv2.AmazonCloudSearchRequestException: {
    "__type": "#DocumentServiceException",
    "adds": 0,
    "deletes": 0,
    "errors": [{"message": "[*Deprecated*: Use the outer message field] Validation error for field 'passenger_comments': Invalid codepoint 7"}],
    "message": "{ [\"Validation error for field 'passenger_comments': Invalid codepoint 7\"] }",
    "status": "error",
    "warnings": [{"message": "\"add\" has unknown attribute(s): ['lang', 'version'] (near operation with index 1; document_id 9716739)"}]
}
        at aws.services.cloudsearchv2.AmazonCloudSearchClient.updateDocumentRequest(AmazonCloudSearchClient.java:258) ~[amazon-cloudsearch-client-1.3.jar:na]
...

Amazon recommends the following:

Both JSON and XML batches can only contain UTF-8 characters that are valid in XML. Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), and the legal characters of Unicode and ISO/IEC 10646. FFFE, FFFF, and the surrogate blocks D800–DBFF and DC00–DFFF are invalid and will cause errors. (For more information, see Extensible Markup Language (XML) 1.0 (Fifth Edition).) You can use the following regular expression to match invalid characters so you can remove them: /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/

http://docs.aws.amazon.com/cloudsearch/latest/developerguide/preparing-data.html

As AWS is already throwing an exception I don't know if you want to do anything about it or if you want to leave it to the user of the framework. I just wanted to raise the topic.

tahseen commented 9 years ago

Either the AWS documentation is incomplete or there is something else is going on. The regular expression to remove invalid characters /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/ does not matches \u0007. Probably will be a good idea to find out what is actually going on first.

Haiperlink commented 9 years ago

I would say it does. Its ^ reverses the match. So it matches every allowed character except \u0009, \u000a, \u000d, ... Everything else should be removed. Which is also explained in the Text:

Valid characters are the control characters tab (0009), carriage return (000D), and line feed (000A), [...]

tahseen commented 9 years ago

Yes you are right I missed the ^

May be a convenient method to auto remove invalid characters: public void addFieldAutoRemoveInvalidChars(String name, String value)

Although the method name is an eyesore.

jonathanmv commented 8 years ago

There is something missing in the regexp amazon provides and it's the global flag g. So the complete regexp should be

JavaScript RegExp syntax /[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/g

The difference is that if you don't add the g flag, the replacement will happen only in the first match of the invalid characters, not all of them.

Dayjo commented 8 years ago

One annoying thing about this from Amazon is that whilst warnings show which document was erroring, errors don't, so I can't identify the specific document that has the problem.