Closed GoogleCodeExporter closed 9 years ago
Thanks for this report. If the encoding is indeed specified in the
CodedCharacterSet tag, then this sounds like a great one to fix.
Do you have a sample image that I could include in the test suite to reproduce
this bug and ultimately fix it?
Original comment by drewnoakes
on 30 Nov 2011 at 5:04
Actually, there are two issues that need to be addressed:
1. Images with CodedCharacterSet - the character set must be respected
2. Images without CodedCharacterSet (quite frequent) - in this case it must be
possible to specify a default character set to use (and not just use the system
property file.encoding)
Original comment by navaj072...@yahoo.de
on 1 Dec 2011 at 10:07
A sample image without CodedCharacterSet (the umlauts are written in UTF-8).
Original comment by navaj072...@yahoo.de
on 1 Dec 2011 at 10:09
Attachments:
A sample image with CodedCharacterSet was sent by personal email.
Original comment by navaj072...@yahoo.de
on 1 Dec 2011 at 10:18
A nice tool for editing IPTC (and other) meta data manually can be found at
http://www.photome.de/
By the way: The behavior of the String constructor using a byte[] is
unspecified when the given bytes are not valid in the given charset (see Java 6
API doc). The CharsetDecoder seems to be more robust in this case.
Original comment by navaj072...@yahoo.de
on 1 Dec 2011 at 10:58
Did you already have a chance to look into this?
Original comment by navaj072...@yahoo.de
on 8 Dec 2011 at 4:15
I arrived back from holiday yesterday and spent some time on this today.
When I run the image you attached to this issue through metadata-extractor on
my PC, I see this (presumably correct) output:
[Iptc] Caption/Abstract = Das Encoding dieser Metadaten ist nicht deklariert
und lässt sich nur schwer erkennen.
[Iptc] Keywords = häufig;üblich;Lösung;Spaß
Unfortunately I did not receive the image you mentioned via email. Please try
again to metadata-extractor at drewnoakes,com.
I found a few other sample images on the ImageMagick forums
(http://www.imagemagick.org/discourse-server/viewtopic.php?f=3&t=16586) and ran
those through metadata-extractor. They seem to produce the correct strings as
well.
[Iptc] Coded Character Set = ?%G
[Iptc] Enveloped Record Version = ?
[Iptc] Application Record Version = 4
[Iptc] Headline = "voilà l'été"
My system's file.encoding property is set to "UTF-8", which is different to
your "windows-1252".
The 'Coded Character Set' value is encoded in the ISO 2022 format, which is a
seemingly complex standard. Ideally I'd pass these values to a Java API and
have them take care of it for records 2-6 & 8, as stated in the specification.
I think ideally the internal representation of the tag data within the
Directory class would need to be changed, as currently the conversion from byte
to value takes place during the reading of the file. It seems more flexible to
store byte[]'s within Directory and convert when values are read. In this way,
the caller can specify encodings. This is particularly true for Exif, where in
many cases the encoding is not specified in the file at all, yet can vary.
This approach would represent quite a significant change to the internals of
the library and will require some careful thought, planning and testing. I
will try to size this up as soon as possible.
I'll also try to devise an effective shorthand route, even if it only satisfies
IPTC data for now. If you have any ideas in this space, I'd be very happy to
hear them.
Also I'd like to create unit tests for these cases. These tests should behave
the same way in all environments (which the current implementation wouldn't
satisfy since it relies upon the system property 'file.encoding'). This seems
like a good place to start. Any help you could offer regarding such tests
would be appreciated.
Original comment by drewnoakes
on 2 Jan 2012 at 7:06
Here are two more images:
* encoding-undefined-iso.jpg: no CodedCharacterSet defined, but umlauts encoded
in Latin-1
* encoding-defined-utf8.jpg: CodedCharacterSet=UTF-8, umlauts encoded in UTF-8
Original comment by navaj072...@yahoo.de
on 13 Jan 2012 at 2:10
Attachments:
I can imagine that supporting the CodedCharacterSet will be quite some work.
But I think it would already be a big step forward if the code would not depend
on the file.encoding any more.
I think you should just ignore this system property, and instead allow to
specify a default encoding in the API. In this way the user had a possibility
to control the output.
Original comment by navaj072...@yahoo.de
on 13 Jan 2012 at 2:15
Hi,
I have some code that is trying to take advantage of this by manipulating bytes
if you want to take a look.
My main problem seems to be able to have 'correct' IPTC data : we are using
XnViewMP to edit the metadata but it looks like althought we are defining the
usage of UTF-8 as encoding, that it is the locale of the systeme to
encode/decode.
I also try to guess the encoding but it is not satisfying currently, if someone
has a betterway.
The source code :
https://github.com/Silverpeas/Silverpeas-Components/blob/master/gallery/gallery-
ejb/src/main/java/com/silverpeas/gallery/image/DrewImageMetadataExtractor.java
Original comment by emmanuel...@gmail.com
on 6 Apr 2012 at 6:54
Original comment by drewnoakes
on 16 Oct 2012 at 5:42
Any news on this? Not being able to get to the original data makes this quite
unusable; the only success I had was by manually setting file.encoding to
iso8859-1, and then use getString(..., "UTF-8") -- which demonstrates that this
method does not really work that well.
Original comment by axel.doe...@gmail.com
on 5 Aug 2013 at 12:51
Has someone got a good tip for this?
Original comment by opty...@gmail.com
on 19 Sep 2013 at 4:28
Change system file.encoding could be very dangerous, finally, for solve my
French caracter problem, I change IptcReader line 168 to str =
reader.getString(offset, tagByteCount, "ISO-8859-1"); althought, it is not the
best solution(should detect file encoding first), but it works for me in the
most cases.
Original comment by tengma.c...@gmail.com
on 8 Jan 2014 at 9:33
[deleted comment]
This issue has been migrated along with the project to GitHub:
https://github.com/drewnoakes/metadata-extractor/issues/12
Original comment by drewnoakes
on 19 Nov 2014 at 1:09
This issue has now been closed and will be included in 2.7.0, releasing this
weekend.
Original comment by drewnoakes
on 5 Dec 2014 at 11:02
Original issue reported on code.google.com by
navaj072...@yahoo.de
on 30 Nov 2011 at 3:48