ntlcn / xmappr

Automatically exported from code.google.com/p/xmappr
0 stars 0 forks source link

UTF-8 Encoded files cause fromXML(stream) parser to fail. #32

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Build/Use a project that uses the XMappr.fromXML function
2. Identify the XML file in use. 
3. open the file in notepad++ and change the encoding to UTF-8. 
4. Do a file diff to see the special characters introduced by that encoding.  
5. run the testing project again and get the error: org.xmappr.XmapprException: 
Error reading XML stream: ParseError at [row,col]:[1,1]
6. change encoding back to ANSI and repeat test without failure.

Original issue reported on code.google.com by d...@morris2morris.com on 19 Nov 2010 at 7:17

GoogleCodeExporter commented 8 years ago
Your editor produces a file with UTF-8 BOM: 
http://www.w3.org/International/questions/qa-utf8-bom.en.php

Try deleting the first three characters.

Original comment by pe...@knego.net on 19 Nov 2010 at 7:56

GoogleCodeExporter commented 8 years ago
This error is produced by UTF-8 BOM at the beginning of the file: 
http://www.w3.org/International/questions/qa-utf8-bom.en.php

BOM is used to define byte order on different transport streams. 

Xmappr is XML parsing library and is not concerned with transport protocol 
issues.

Even JVM itself is not concerned with BOM: 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058

Original comment by peter.kn...@gmail.com on 19 Nov 2010 at 8:09

GoogleCodeExporter commented 8 years ago
That's understandable, I guess.  And yes, deleting the first three characters 
might work, but the easier part is just to change the encoding on the file to 
ANSI.  However, that removes the 'ease' in which the code works in the examples 
where people at least need to know that if your file uses the UTF-8 Encoding, 
they will have problems in the code with the provided example.  It would just 
be nice to see that 'gotcha' listed somewhere, if not as a bug, at least just 
making users aware. 

Original comment by d...@morris2morris.com on 19 Nov 2010 at 8:54

GoogleCodeExporter commented 8 years ago
This is really a speciality of Microsoft software. Other systems don't do this.

And you are right - we can detect those threw bytes and just discard them.

Original comment by pe...@knego.net on 19 Nov 2010 at 9:44

GoogleCodeExporter commented 8 years ago
Also, encoding in XML is defined by this header not by BOM:

<?xml version="1.0" encoding="utf-8"?>

Original comment by pe...@knego.net on 19 Nov 2010 at 9:49