haraldk / TwelveMonkeys

TwelveMonkeys ImageIO: Additional plug-ins and extensions for Java's ImageIO
https://haraldk.github.io/TwelveMonkeys/
BSD 3-Clause "New" or "Revised" License
1.84k stars 309 forks source link

Allow EXIFReader to use other encodings than ASCII/UTF-8 #365

Open SvenBunge opened 7 years ago

SvenBunge commented 7 years ago

I try to read a bunch of old tiff files encoded on an old windows system. All ascii headers are encoded in "windows-1252"/"Cp1250" but EXIFReader doesn't provide any ability to override the default encoding: UTF-8.

As far as I can see the issue is located in line 315 (Version 3.3.2):

return StringUtil.decode(ascii, 0, len, "UTF-8"); 

We could add two constructors to EXIFReader:

public EXIFReader() {
    this(Charset.forName("UTF-8"));
}
public EXIFReader(Charset charset) {
    this.charset = charset;
}

WDYT?

haraldk commented 6 years ago

Hi Sven,

Strictly speaking, a TIFF ASCII field can only ever have one encoding, and that is ASCII (7 bit ASCII). The encoder should have used BYTE or UNDEFINED to store anything else. I chose UTF-8 in the parser code somewhat arbitrarily, as it will also gracefully decode ASCII.

But of course, as we live in the real world, this is what we have to deal with. I've been thinking about adding a parameter to the constructor allowing more control of how the IFDs are parsed (like what nested IFDs to parse) etc., so it will probably make sense to add a non-standard encoding setting here as well.

New functionality will be added to the TIFFReader only, as the EXIFReader is now deprecated. Something like:

public TIFFReader(Options options) {
    ...
} 

public final class TIFFOptions extend Options {
    Charset charset = StandardCharsets.UTF_8;
    Set<Integer> subIFDs = new HashSet(Arrays.asList(TIFF.TAG_EXIF_IFD, TIFF.TAG_INTEROP_IFD));

    // setters/getters etc
} 

Best regards,

-- Harald K

garretwilson commented 3 years ago

This ASCII vs UTF-8 thing in Exif gets really gnarly. (See my extensive musings and research in GUISE-148.) The official Exif 2.32 specification defines the "ASCII" type as:

An 8-bit byte containing one 7-bit ASCII code. The final byte is terminated with NULL.

But a Metadata Working Group put out a Guidelines for Handling Image Metadata, which says:

Exif string values SHOULD be written as UTF-8. However, clients MAY write ASCII to allow broader interoperability.

I note also that Windows 10 seems to support UTF-8 in Exif in Windows Explorer properties.

But UTF-8 is really easy to distinguish from ISO-8859-1, because if it parses correctly, it's probably UTF-8. I recommend that you attempt to parse as UTF-8, and then fall back to ISO-8859-1 if there were any parsing errors. This would be >99% equivalent to supporting Windows CP1250. This is the same approach used in Java 9+ properties files. Then you get the intended UTF-8 value when present, but ISO-8859-1 still works correctly. And if it's only ASCII, you get ASCII.

haraldk commented 3 years ago

@garretwilson

I think the problem isn't that it can't be done, but from a library standpoint, why 8859-1? Why not one of the dozen or so other ISO-8859-defined encodings? And how do you distinguish between them?

Java property files was defined to be ISO-8859-1 from the start, so using it as a fallback there is an obvious choice, as that's the only encoding (except ASCII) it can be. TIFF/Exif has never been using this encoding in any spec, only in strictly incorrect implementations.

I still think it's best leaving this to client code (which might know things like "imported files are all from a Windows computer using CP1252") . But perhaps just exposing the non-decoded bytes, in addition to/instead of the incorrectly decoded strings.

-- Harald K

garretwilson commented 3 years ago

Java property files was defined to be ISO-8859-1 from the start, so using it as a fallback there is an obvious choice …

Ah, @haraldk , you make a good point. You're right, there is less probability in this situation that ISO-8859-1 would be the correct eight-bit encoding. Java properties files as you noted were originally defined to use ISO-8859-1.

I guess my bigger fear was that you might switch to only supporting ASCII, so I'm glad that's not the case. 😄

haraldk commented 3 years ago

No worries, I'll continue to use UTF-8. 😀

-- Harald K