google / guava

Google core libraries for Java
Apache License 2.0
50.08k stars 10.87k forks source link

Files and Resources do not handle UTF-8 files with BOM #345

Open gissuebot opened 9 years ago

gissuebot commented 9 years ago

Original issue created by kai@google.com on 2010-04-08 at 07:59 PM


By the UTF-8 definition, UTF-8 files are allowed to have an optional leading BOM. This BOM is stupid and pointless, but many Windows apps seem to generate UTF-8 files with the BOM. Guava's classes Files and Resources do not handle UTF-8 files with a BOM. I'm not sure where this fix belongs, or whether it should even be fixed at all (since Windows is being stupid, and people are rightly sick and tired of working around Windows issues). BTW, I don't personally use Windows. I'm reporting this issue only because I maintain a library that uses Guava, and there are some Windows users of my library that are running into this issue.

gissuebot commented 9 years ago

Original comment posted by kevinb@google.com on 2010-04-09 at 07:05 PM


Good to know. Based on this, it will probably make sense for us to check for a byte- order mark and just advance past it. Do JDK classes like Reader do this, I assume?

gissuebot commented 9 years ago

Original comment posted by fry@google.com on 2011-01-28 at 04:03 PM


(No comment entered for this change.)


Status: Accepted Labels: Type-Defect

gissuebot commented 9 years ago

Original comment posted by mail4danny on 2011-02-04 at 04:18 PM


No, the JDK just quietly ignores this ;-)

gissuebot commented 9 years ago

Original comment posted by finnw1 on 2011-02-04 at 09:01 PM


The JDK does detect (and strip) the BOM for some encodings, e.g. Standard encodings: UTF-16 UTF-32 Non-standard encodings (that are reported by Charset.availableCharsets()) on my system: x-UTF-16LE-BOM X-UTF-32BE-BOM X-UTF-32LE-BOM It's for those that are not expected to contain a BOM that the BOM is returned to the application. It sounds as though what you want is a standard encoding based on UTF-8 that accepts a BOM, e.g. UTF-8-BOM But (1) this is a feature request not a defect and     (2) it belongs in the JDK not Guava

gissuebot commented 9 years ago

Original comment posted by kevinb@google.com on 2011-07-13 at 06:18 PM


(No comment entered for this change.)


Status: Triaged

gissuebot commented 9 years ago

Original comment posted by fry@google.com on 2011-12-10 at 03:45 PM


(No comment entered for this change.)


Labels: Package-IO

gissuebot commented 9 years ago

Original comment posted by fry@google.com on 2012-02-16 at 07:17 PM


(No comment entered for this change.)


Status: Acknowledged

gissuebot commented 9 years ago

Original comment posted by kevinb@google.com on 2012-06-22 at 06:16 PM


(No comment entered for this change.)


Status: Research

gissuebot commented 9 years ago

Original comment posted by j...@durchholz.org on 2012-12-27 at 10:01 AM


1) The BOM is useful if the program needs a way to autodetect a text file's encoding. This is so even in Unixoid systems, the output from the file command says "UTF-8 Unicode (with BOM) text", so if it's a misfeature, it's not just one of Windows. Of course it's just heuristics, but heuristics does have its place. 2) Arguing that something belongs into the JDK instead of into Guava ignores the very mission statement of Guava, which is essentially "let's do things right where the JDK dropped the ball". So in fact if the JDK does this wrong, Guava should do something about it. 3) Some programs want to see the BOM, others want to have the BOM skipped for them if it's present. Programs need a way to express that. Using different character sets would cover that. 4) I'm not sure what the semantics of an x-UTF-8-BOM charset would be when writing: Write a BOM or not? The path of minimal resistance would be to write the BOM with x-UTF-8-BOM and leave it unwritten with UTF-8, but that would punish the best approach (ignore BOM on input, don't write it on output) with the most complicated handling (different character sets for reading and writing).

gissuebot commented 9 years ago

Original comment posted by NikolayMetchev on 2014-01-07 at 11:57 AM


This was filed as a bug in the JDK. The decided not to fix it there for backward compatibility reasons: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058 Google Data API has a solution which could be moved to Guava: https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/UnicodeReader?csw=1

garretwilson commented 8 years ago

Any progress on this? Won't Guava help us read a BOM?

jredfox commented 6 years ago

have their input stream remove any char with the value of 65279 at index 0. It's not pointless notepad uses it to easily determine what utf type the file is in. To be honest I think this is what file headers are made for why not just have a file header with the string utf-x in front of it only takes a couple bytes but, I didn't make utf protocal