IPTC CodedCharacterSet not supported

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run the metadata-extractor on an image with the CodedCharacterSet tag set 
(e.g. to "UTF-8") and field values encoded in UTF-8.

What is the expected output? What do you see instead?
Special characters (such as german umlauts) are broken, because the system file 
encoding set to windows-1252.

What version of the product are you using? On what operating system?
2.5.0-RC2 on Windows 7

Please provide any additional information below.
The IPTC CodedCharacterSet tag can be used to specify the encoding of the IPTC 
tag values. But the setting is currently ignored.

Original issue reported on code.google.com by navaj072...@yahoo.de on 30 Nov 2011 at 3:48

GoogleCodeExporter commented 9 years ago

Thanks for this report.  If the encoding is indeed specified in the 
CodedCharacterSet tag, then this sounds like a great one to fix.

Do you have a sample image that I could include in the test suite to reproduce 
this bug and ultimately fix it?

Original comment by drewnoakes on 30 Nov 2011 at 5:04

Changed state: Accepted
Added labels: Priority-High
Removed labels: Priority-Medium

GoogleCodeExporter commented 9 years ago

Actually, there are two issues that need to be addressed:
1. Images with CodedCharacterSet - the character set must be respected
2. Images without CodedCharacterSet (quite frequent) - in this case it must be 
possible to specify a default character set to use (and not just use the system 
property file.encoding)

Original comment by navaj072...@yahoo.de on 1 Dec 2011 at 10:07

GoogleCodeExporter commented 9 years ago

A sample image without CodedCharacterSet (the umlauts are written in UTF-8).

Original comment by navaj072...@yahoo.de on 1 Dec 2011 at 10:09

Attachments:

unknown_encoding.jpg

GoogleCodeExporter commented 9 years ago

A sample image with CodedCharacterSet was sent by personal email.

Original comment by navaj072...@yahoo.de on 1 Dec 2011 at 10:18

GoogleCodeExporter commented 9 years ago

A nice tool for editing IPTC (and other) meta data manually can be found at
  http://www.photome.de/

By the way: The behavior of the String constructor using a byte[] is 
unspecified when the given bytes are not valid in the given charset (see Java 6 
API doc). The CharsetDecoder seems to be more robust in this case.

Original comment by navaj072...@yahoo.de on 1 Dec 2011 at 10:58

GoogleCodeExporter commented 9 years ago

Did you already have a chance to look into this?

Original comment by navaj072...@yahoo.de on 8 Dec 2011 at 4:15

GoogleCodeExporter commented 9 years ago

I arrived back from holiday yesterday and spent some time on this today.

When I run the image you attached to this issue through metadata-extractor on 
my PC, I see this (presumably correct) output:

[Iptc] Caption/Abstract = Das Encoding dieser Metadaten ist nicht deklariert 
und lässt sich nur schwer erkennen.
[Iptc] Keywords = häufig;üblich;Lösung;Spaß

Unfortunately I did not receive the image you mentioned via email.  Please try 
again to metadata-extractor at drewnoakes,com.

I found a few other sample images on the ImageMagick forums 
(http://www.imagemagick.org/discourse-server/viewtopic.php?f=3&t=16586) and ran 
those through metadata-extractor.  They seem to produce the correct strings as 
well.

[Iptc] Coded Character Set = ?%G
[Iptc] Enveloped Record Version = ?
[Iptc] Application Record Version = 4
[Iptc] Headline = "voilà l'été"

My system's file.encoding property is set to "UTF-8", which is different to 
your "windows-1252".

The 'Coded Character Set' value is encoded in the ISO 2022 format, which is a 
seemingly complex standard.  Ideally I'd pass these values to a Java API and 
have them take care of it for records 2-6 & 8, as stated in the specification.

I think ideally the internal representation of the tag data within the 
Directory class would need to be changed, as currently the conversion from byte 
to value takes place during the reading of the file.  It seems more flexible to 
store byte[]'s within Directory and convert when values are read.  In this way, 
the caller can specify encodings.  This is particularly true for Exif, where in 
many cases the encoding is not specified in the file at all, yet can vary.

This approach would represent quite a significant change to the internals of 
the library and will require some careful thought, planning and testing.  I 
will try to size this up as soon as possible.

I'll also try to devise an effective shorthand route, even if it only satisfies 
IPTC data for now.  If you have any ideas in this space, I'd be very happy to 
hear them.

Also I'd like to create unit tests for these cases.  These tests should behave 
the same way in all environments (which the current implementation wouldn't 
satisfy since it relies upon the system property 'file.encoding').  This seems 
like a good place to start.  Any help you could offer regarding such tests 
would be appreciated.

Original comment by drewnoakes on 2 Jan 2012 at 7:06

GoogleCodeExporter commented 9 years ago

Here are two more images:
* encoding-undefined-iso.jpg: no CodedCharacterSet defined, but umlauts encoded 
in Latin-1
* encoding-defined-utf8.jpg: CodedCharacterSet=UTF-8, umlauts encoded in UTF-8

Original comment by navaj072...@yahoo.de on 13 Jan 2012 at 2:10

Attachments:

GoogleCodeExporter commented 9 years ago

I can imagine that supporting the CodedCharacterSet will be quite some work. 
But I think it would already be a big step forward if the code would not depend 
on the file.encoding any more.
I think you should just ignore this system property, and instead allow to 
specify a default encoding in the API. In this way the user had a possibility 
to control the output.

Original comment by navaj072...@yahoo.de on 13 Jan 2012 at 2:15

GoogleCodeExporter commented 9 years ago

Hi,
I have some code that is trying to take advantage of this by manipulating bytes 
if you want to take a look.
My main problem seems to be able to have 'correct' IPTC data : we are using 
XnViewMP to edit the metadata but it looks like althought we are defining the 
usage of UTF-8 as encoding, that it is the locale of the systeme to 
encode/decode.
I also try to guess the encoding but it is not satisfying currently, if someone 
has a betterway.
The source code :
https://github.com/Silverpeas/Silverpeas-Components/blob/master/gallery/gallery-
ejb/src/main/java/com/silverpeas/gallery/image/DrewImageMetadataExtractor.java

Original comment by emmanuel...@gmail.com on 6 Apr 2012 at 6:54

GoogleCodeExporter commented 9 years ago

Original comment by drewnoakes on 16 Oct 2012 at 5:42

Added labels: Component-Framework, Motive-Correctness

GoogleCodeExporter commented 9 years ago

Any news on this? Not being able to get to the original data makes this quite 
unusable; the only success I had was by manually setting file.encoding to 
iso8859-1, and then use getString(..., "UTF-8") -- which demonstrates that this 
method does not really work that well.

Original comment by axel.doe...@gmail.com on 5 Aug 2013 at 12:51

GoogleCodeExporter commented 9 years ago

Has someone got a good tip for this?

Original comment by opty...@gmail.com on 19 Sep 2013 at 4:28

GoogleCodeExporter commented 9 years ago

Change system file.encoding could be very dangerous, finally, for solve my 
French caracter problem, I change IptcReader line 168 to str = 
reader.getString(offset, tagByteCount, "ISO-8859-1"); althought, it is not the 
best solution(should detect file encoding first), but it works for me in the 
most cases.

Original comment by tengma.c...@gmail.com on 8 Jan 2014 at 9:33

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

This issue has been migrated along with the project to GitHub:

https://github.com/drewnoakes/metadata-extractor/issues/12

Original comment by drewnoakes on 19 Nov 2014 at 1:09

GoogleCodeExporter commented 9 years ago

This issue has now been closed and will be included in 2.7.0, releasing this 
weekend.

Original comment by drewnoakes on 5 Dec 2014 at 11:02

Changed state: Done

rude04 / metadata-extractor

IPTC CodedCharacterSet not supported #38