Files and Resources do not handle UTF-8 files with BOM

GoogleCodeExporter commented 9 years ago

By the UTF-8 definition, UTF-8 files are allowed to have an optional leading 
BOM.  This BOM is stupid and pointless, but many Windows apps seem to 
generate UTF-8 files with the BOM.  Guava's classes Files and Resources do 
not handle UTF-8 files with a BOM.  I'm not sure where this fix belongs, or 
whether it should even be fixed at all (since Windows is being stupid, and 
people are rightly sick and tired of working around Windows issues).  BTW, I 
don't personally use Windows.  I'm reporting this issue only because I 
maintain a library that uses Guava, and there are some Windows users of my 
library that are running into this issue.

Original issue reported on code.google.com by k...@google.com on 8 Apr 2010 at 7:59

GoogleCodeExporter commented 9 years ago

Good to know. Based on this, it will probably make sense for us to check for a 
byte-
order mark and just advance past it. Do JDK classes like Reader do this, I 
assume?

Original comment by kevinb@google.com on 9 Apr 2010 at 7:05

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 28 Jan 2011 at 4:03

Changed state: Accepted
Added labels: Type-Defect

GoogleCodeExporter commented 9 years ago

No, the JDK just quietly ignores this ;-)

Original comment by mail4da...@gmail.com on 4 Feb 2011 at 4:18

GoogleCodeExporter commented 9 years ago

The JDK does detect (and strip) the BOM for some encodings, e.g.
Standard encodings:
UTF-16
UTF-32
Non-standard encodings (that are reported by Charset.availableCharsets()) on my 
system:
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
It's for those that are not expected to contain a BOM that the BOM is returned 
to the application.
It sounds as though what you want is a standard encoding based on UTF-8 that 
accepts a BOM, e.g.
UTF-8-BOM
But (1) this is a feature request not a defect and
    (2) it belongs in the JDK not Guava

Original comment by fin...@gmail.com on 4 Feb 2011 at 9:01

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 13 Jul 2011 at 6:18

Changed state: Triaged

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 10 Dec 2011 at 3:45

Added labels: Package-IO

GoogleCodeExporter commented 9 years ago

Original comment by fry@google.com on 16 Feb 2012 at 7:17

Changed state: Acknowledged

GoogleCodeExporter commented 9 years ago

Original comment by kevinb@google.com on 22 Jun 2012 at 6:16

Changed state: Research

GoogleCodeExporter commented 9 years ago

1) The BOM is useful if the program needs a way to autodetect a text file's 
encoding. This is so even in Unixoid systems, the output from the file command 
says "UTF-8 Unicode (with BOM) text", so if it's a misfeature, it's not just 
one of Windows. Of course it's just heuristics, but heuristics does have its 
place.
2) Arguing that something belongs into the JDK instead of into Guava ignores 
the very mission statement of Guava, which is essentially "let's do things 
right where the JDK dropped the ball". So in fact if the JDK does this wrong, 
Guava should do something about it.
3) Some programs want to see the BOM, others want to have the BOM skipped for 
them if it's present. Programs need a way to express that. Using different 
character sets would cover that.
4) I'm not sure what the semantics of an x-UTF-8-BOM charset would be when 
writing: Write a BOM or not? The path of minimal resistance would be to write 
the BOM with x-UTF-8-BOM and leave it unwritten with UTF-8, but that would 
punish the best approach (ignore BOM on input, don't write it on output) with 
the most complicated handling (different character sets for reading and 
writing).

Original comment by j...@durchholz.org on 27 Dec 2012 at 10:01

GoogleCodeExporter commented 9 years ago

This was filed as a bug in the JDK. The decided not to fix it there for 
backward compatibility reasons:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
Google Data API has a solution which could be moved to Guava:
https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/Unicod
eReader?csw=1

Original comment by NikolayM...@gmail.com on 7 Jan 2014 at 11:57

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:15

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:19

Changed state: Migrated

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:10

Added labels: Restrict-AddIssueComment-Commit

maidh91 / guava-libraries

Files and Resources do not handle UTF-8 files with BOM #345