What steps will reproduce the problem?
1. Create a text file encoded as UTF-16 little endian.
2. Edit hex and remove the BOM from the file. Yes, this is purposely modifying
the file to cause a problem but I have been encountering many examples of
UTF-16 encoded files lacking a BOM as provided to me from other applications.
And not having a BOM does not invalidate the file.
3. Test Ude.Example by passing path to this BOM-less UTF-16LE file
4. When UniversalDetector is called the first check is to look for a BOM.
5. Not having a BOM, the evaluation passes to the deeper analysis which returns
a result of encoding = ANSI 1252 which is wrong.
What is the expected output?
Expected output is encoding = "UTF-16"
What do you see instead?
"Charset: ASCII, confidence: 1"
What version of the product are you using? On what operating system?
Ude C# port with all current code changes applied
Window 7 Ultimate SP1 64-bit
Please provide any additional information below.
Larger files (1000kb+) lacking the BOM tend to show result of "Charset:
windows-1252, confidence: 0.5"
Original issue reported on code.google.com by jeffb...@gmail.com on 17 Sep 2012 at 10:52
Original issue reported on code.google.com by
jeffb...@gmail.com
on 17 Sep 2012 at 10:52