Closed GoogleCodeExporter closed 9 years ago
Good to know. Based on this, it will probably make sense for us to check for a
byte-
order mark and just advance past it. Do JDK classes like Reader do this, I
assume?
Original comment by kevinb@google.com
on 9 Apr 2010 at 7:05
[deleted comment]
Original comment by fry@google.com
on 28 Jan 2011 at 4:03
No, the JDK just quietly ignores this ;-)
Original comment by mail4da...@gmail.com
on 4 Feb 2011 at 4:18
The JDK does detect (and strip) the BOM for some encodings, e.g.
Standard encodings:
UTF-16
UTF-32
Non-standard encodings (that are reported by Charset.availableCharsets()) on my
system:
x-UTF-16LE-BOM
X-UTF-32BE-BOM
X-UTF-32LE-BOM
It's for those that are not expected to contain a BOM that the BOM is returned
to the application.
It sounds as though what you want is a standard encoding based on UTF-8 that
accepts a BOM, e.g.
UTF-8-BOM
But (1) this is a feature request not a defect and
(2) it belongs in the JDK not Guava
Original comment by fin...@gmail.com
on 4 Feb 2011 at 9:01
Original comment by kevinb@google.com
on 13 Jul 2011 at 6:18
Original comment by fry@google.com
on 10 Dec 2011 at 3:45
Original comment by fry@google.com
on 16 Feb 2012 at 7:17
Original comment by kevinb@google.com
on 22 Jun 2012 at 6:16
1) The BOM is useful if the program needs a way to autodetect a text file's
encoding. This is so even in Unixoid systems, the output from the file command
says "UTF-8 Unicode (with BOM) text", so if it's a misfeature, it's not just
one of Windows. Of course it's just heuristics, but heuristics does have its
place.
2) Arguing that something belongs into the JDK instead of into Guava ignores
the very mission statement of Guava, which is essentially "let's do things
right where the JDK dropped the ball". So in fact if the JDK does this wrong,
Guava should do something about it.
3) Some programs want to see the BOM, others want to have the BOM skipped for
them if it's present. Programs need a way to express that. Using different
character sets would cover that.
4) I'm not sure what the semantics of an x-UTF-8-BOM charset would be when
writing: Write a BOM or not? The path of minimal resistance would be to write
the BOM with x-UTF-8-BOM and leave it unwritten with UTF-8, but that would
punish the best approach (ignore BOM on input, don't write it on output) with
the most complicated handling (different character sets for reading and
writing).
Original comment by j...@durchholz.org
on 27 Dec 2012 at 10:01
This was filed as a bug in the JDK. The decided not to fix it there for
backward compatibility reasons:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4508058
Google Data API has a solution which could be moved to Guava:
https://developers.google.com/gdata/javadoc/com/google/gdata/util/io/base/Unicod
eReader?csw=1
Original comment by NikolayM...@gmail.com
on 7 Jan 2014 at 11:57
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:15
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:19
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:10
Original issue reported on code.google.com by
k...@google.com
on 8 Apr 2010 at 7:59