Write IPTC-NAA metadata.

jasonwpalmer commented 9 years ago

So if you open a TIFF image in Photoshop. Then open the File Info dialog - you get a panel displaying the File Info associated with the TIFF image. Supposedly, all of this information is derived from the IPTC-NAA Photoshop IRB. If you click Raw Data you see the xmp xml document. I am assuming this document is what composes the IPTC-NAA Photoshop IRB.

Am I correct?

The goal for this request is to be able to read in a TIF source file and then supply some added metadata to be embedded within the IPTC-NAA block and then write a new TIF image with the added metadata.

Again, I would attach a TIF to open in Photoshop, but it still looks like GitHub only supports PNG, GIF, or JPG - crazy huh?

haraldk commented 9 years ago

Arrright,

Sooo... There's XMP involved. It might actually be easier to just use Adobe's XMP toolkit. It's actually (believe it or not) BSD licensed. At least have a look at it.

Anyway, the XMP data is in a separate IRB (see Image resource IDs and the com.twelvemonkeys.imageio.plugins.psd.PSDXMPData class). And, there's some interesting duplication of the TIFF/Exif and IPTC metadata in the XMP. I think it's using some kind of hash/digest to test for integrity (ie. if one is changed without the other). We have to keep these in sync I think. But still very doable.

Re: GitHub attachments, yes, agree. I've asked GitHub for this functionality, and it's on their list. But they don't see it as an important enough feature I guess. Feel free to ask for it, you too.

Anyway, just rename your .tif files _tif.jpg and the quite lame GitHub attachment filter is defeated. ;-)

Harald K

jasonwpalmer commented 9 years ago

Yes, I read the docs and downloaded Adobe's XMP Toolkit. My boss is hesitant to use an Adobe library because he thinks it might be a bit heavy. But I am going to take a look. Honestly - it would be nice to take a look and then add support for Writing XMP to 12 monkeys. I will supply some before and after TIFs so we can compare the metadata before and after adding the fields. And we can see if both the XMP and IPTC-NAA blocks change. A hash huh that checks that both fields are in sync? - How on earth did you figure that out? :)

jasonwpalmer commented 9 years ago

Maybe I misunderstood. You are saying that there is duplication between the EXIF and the XMP.

When you say TIFF/EXIF are you referring to Photoshop Image Resource 1028?

The docs say that Photoshop stopped using IPTC-NAA with CS5 (Image Resource ID#1028). I assumed at this point they switched exclusively to XMP (ID#1060) and no longer used IPTC-NAA. But you are suggesting that both are being updated and they need to be kept in sync.

Or are you suggesting another block (Somewhere else in the EXIF not IPTC-NAA ID#1028) needs to be kept in sync with the XMP block (ID#1060)?

jasonwpalmer commented 9 years ago

OK - so from what I am seeing (And I am using Photoshop SAAS so I have the latest) both #1028 and #1060 are updated by entering info in the File Info dialog in Photoshop.

I think I understand what you are saying and I am seeing it happen.

Very interesting - and stupid really. Maybe somebody should mention to those Adobe guys that they shouldn't duplicate data - ever. :) (I also just noticed how lightweight the Adobe XMP library is :))

haraldk commented 9 years ago

Can't say how good or "heavy" the Adobe XMP implementation really is, however the spec is quite heavy, so implementing it completely will take some time... But we might be able to at least implement what is needed for your use case, and then build on it from there.

The above was just from the top of my head (with my daughter pulling my arm and asking 1097 times if I was ready yet while typing...), so the details may not be 100% correct, but at least it should be in principle.

The reason for the data duplication is mainly backwards compatibility, I think. If you look at the PSD format, there's a lot of data duplication. However, you can still open files from new versions in older versions of PS, and it will use the older data. This is also why we have to keep fields in sync. But yes, in many cases revolution is easier (or cheaper) than evolution. ;-)

Do you have some sample before/after files for me? If you don't like the renaming strategy, feel free to use normal email, or share via DropBox.

Harald K

jasonwpalmer commented 9 years ago

Here are a before and after adding keywords...

I will have to find a few more. Tuesday I can get 5 recent before and afters.

marketing-keywords_tif marketing-src_tif

haraldk commented 9 years ago

All right!

Here's what I've found so far: The "marketing-keywords.tif" is slightly mangled (seems it didn't survive the byte order change, but that is likely Adobe's fault), did you use icafe TIFFTweaker by any chance? ;-)

Also, there's quite a bit of duplication.

TIFF/Exif:
This is the container structure of the file (fields here were changed in "marketing-keywords.tif). 
Contains the following interesting tags:

305/Software: Adobe Photoshop CC (Macintosh) (ASCII)
306/DateTime: 2013:12:05 19:02:45 (ASCII)
33432/Copyright: Copyright 2009 by itemMaster.com  All Rights Reserved (ASCII)
...

700/XMP
    The XMP data contains some interesting fields, duplicated from the above, 
    plus most fields from the IPTC data. These fields seems unchanged between the files.

    dc:rights: Copyright 2009 by itemMaster.com  All Rights Reserved (String)**
    photoshop:City: Skokie (String)*
    photoshop:Country: United States of America (String)*
    photoshop:Instructions: This image may not be copied, altered, reproduced by any means, or transferred without written permission of itemmaster.com (String)*
    photoshop:State: IL 60077 (String)*
    xap:CreateDate: 2013-12-04T08:35:17-06:00 (String)***
    xap:CreatorTool: Adobe Photoshop CC (Macintosh) (String)†
    xap:ModifyDate: 2013-12-05T19:02:45-06:00 (String)†
    ...

33723/IPTC
    The IPTC fields seems unchanged between the files.

    2:40/Instructions: This image may not be copied, altered, reproduced by any means, or transferred without written permission of itemmaster.com (String)
    2:62/DigitalCreationDate: 20131204 (String)
    2:63/DigitalCreationTime: 083517-0600 (String)
    2:90/City: Skokie (String)
    2:95/StateProvince: IL 60077 (String)
    2:101/Country: United States of America (String)
    2:116/CopyrightNotice: Copyright 2009 by itemMaster.com  All Rights Reserved (String)

34377/Adobe
    PSD resources. Nothing of interest, so far.

37724/ImageSourceData
    This block is HUGE (2407912 bytes, 1/3 of the file), 
    and contains PSD data (8BIM) in TIFF container's byte order... 
    I haven't been able to parse it yet, as it is different from the 34377 structure, 
    but looking at it in a hex viewer, it seems to some contain interesting info.

) Mapped from same field in IPTC ) Mapped from same field in both IPTC and TIFF/Exif ) Mapped from the two date/time fields in IPTC †) Mapped from the corresponding field in TIFF/Exif

Oh... And the mapping between IPTC and XML seems to be standardized as well..

Regards,

Harald K

jasonwpalmer commented 9 years ago

Apologies.

I had a very recent TIFF marketing image after I added 2 "Keyword" and a "Source" attribute in the File Info dialog in Photoshop, but I couldn't find the before TIFF. So I used an older before and after that I was unsure of - and quite possibly I ran TIFFTweaker on it :). In fact, I am almost positive I did something to it because Photoshop now refuses to open it.

I appreciate you spending the time to check them out - so I apologize for not giving you a better sample.

I have attached a new TIFF - Photograph was recently taken by an itemMaster Photographer and then subsequently edited in Photoshop CS5 by an Editor at itemMaster. Then I opened up the edited TIFF and added KeyWords and Source via the File Info dialog, however now I am using Photoshop 2015 - but this is the software ultimately adding the 2 File Info fields.

Keywords: 12 Monkeys, HaraldK Source: HaraldK

Hopefully this is a better starting point. I am surprised to see that we are inserting...

Maybe I should lay off the Adobe guys...

mktg-after_tif mktg-b4_tif

haraldk commented 9 years ago

Okay, good stuff!

I can find the source and keywords in both IPTC and XMP (IPTC source is photoshop:source in XMP, IPTC keywords is dc:subject, both as expected).

Also the TIFF DateTime tag is reflected in the XMP as xap:ModifyDate.

BTW: Here's a better document, describing the IPTC to XMP mapping.

So, to update we need to update both structures. But that should be very doable.

PS: Would be nice to have an image with updated copyright, to see if both TIFF, IPTC and XMP was modified as I expect.

Harald K

jasonwpalmer commented 9 years ago

Great. I skimmed over the spec yesterday. Trying to understand how the xmp is serialized and the xmp data structures and such - the idea of padding might get a bit difficult, but I think it will be fun and a challenge to implement.

I need to track down where the copyright is coming from. It looks like it came from Photoshop so it may very well be that the editor that added the Clipping Path is using a script or macro - or however they do it in Photoshop to add the Copyright. I will speak to the Department head on Tuesday and see if they have all editors use a common script or what exactly is going on with that.

Obviously, this has little to do with this request - but now I need to check this out because it's wrong and shouldn't be happening.

Also - you'll notice that I think out loud sometimes. And I understand that we have quite a Time-Zone gap between us (I'm in Chicago, IL Central-Standard-Time). Most of the time - I'll answer my own questions if you give me a few hours. So please don't feel obligated to answer every time I post something - I know how busy I am and I'm guessing it's at least the same for you. However, that being said - I'll completely obsess over a problem until I can find the answer. With a little guidance - we can use that to our advantage here. :)

jasonwpalmer commented 9 years ago

OK - Copyright should be correct now. It was my fault again.

To Review: 1.Photograph taken recently in itemMaster studio. 2.Lightroom is used to convert from camera raw to TIFF. 3.Photoshop used to edit the image (Add a Clipping Path) Photoshop CS6 4.Then I opened the edited TIFF with Photoshop 2015 and added the same Keywords and Source as last time.

tues-2_tif tues-2-after_tif

jasonwpalmer commented 9 years ago

I just realized that writing a new image might not work for my use-case. I realized that there is a bunch of metadata that I will lose - including the Clipping Path (unacceptable). So can we do this by updating the metadata in place? I know you said it could be done, but I also understand that this adds complexity and challenges with offsets, etc. What do you think?

haraldk commented 9 years ago

We'll have to make sure we pass any metadata along to the new image, so we don't lose any information. So I don't think it will matter much if we write a new image or modify an existing. Might be easier though, to update it in place. At least the XMP can be, because it's usually padded with a lot of white space at the end that we can simply overwrite. I've also been thinking about appending new TIFF fields at the end of the file if needed, to avoid rewriting the file (just modifying some pointers). I think it should be possible.

In any case, here's a list of tasks that needs to be done (I might open separate issues for these, and keep this issue as a "parent" issue):

Read all IPTC data (currently, only application records are read), so we can write updated versions back without losing information.
Write the IPTC data structure back.
Write XMP data (or just implement something using the Adobe toolkit).
Create API that allows us to modify TIFF, IPTC and XMP in one go.

I'll start on the IPTC part. :-)

Harald K

jasonwpalmer commented 9 years ago

That sounds great. I'll see if I can learn more about the XMP Toolkit so I can make myself useful. I'm sure we could serialize and write the XMP, but I am not against using something that works and is free to use.

So do you want to introduce a dependency on Adobe for the XMP stuff? - or would we just use the toolkit in the API that is created?

Not quite sure how you see this fitting into the 12 Monkeys library overall, seems like you could go a few ways with it.

Also, a while back I needed to resize a TIFF and resize the Clipping Path (I did it with a bit of a hack using apache's ExifRewriter). Anyway - I believe if your goal is to make the TIF metadata completely writable - the algorithm should come in handy and save you some time - it definitely works. Somehow I thought the Path would be relative and wouldn't need to change if the dimensions changed - that is not correct. Even though the Path points are expressed relative to width and height, they still need to be recalculated if the image is resized. You probably could have told me that, but I had to learn the hard way. Let me know if you think this would be useful.

jasonwpalmer commented 9 years ago

Alright - I think I can/should write the XMPWriter. Unless I am grossly underestimating what needs to be done.

I need to build up a Dom representation (I'm thinking in memory because it shouldn't get too large - unless you think otherwise.) So write models for the different xmp datatypes that can independently build their own Dom representations. Then merge them and transform them into a UTF-8 String Dom document that can be written to the 700 block. Start with the common Namespaces, but make it easy to add more properties as we go. Then we/you :) can figure out how to sync them later. So I guess I'm saying that I would like to try to write the XMP XML stuff using standard APIs. Forget Adobe XMP Toolkit. Do you think that is worthwhile?

haraldk commented 9 years ago

Good stuff!

I'm taking the kids to the grandparents for the weekend, so I won't be able to do anything more before Monday. Have done some necessary changes to IPTC reading, started work on IPTC write support.

It should be possible to create XML directly from the XMPDirectory/XMPEntry instances, without first building a DOM. But do it the way you think is best. We can always optimize and modify things later, should it not work optimally.

I think you should build a single DOM however, and serialize that to a byte[] (or perhaps directly to a ImageOutputStream), using UTF-8. The com.twelvemonkeys.xml.XMLSerializer class can be used to write the DOM.

One thing I thought about long time ago, was to actually keep the DOM representation used when reading in the XMPDirectory, and mutate on that, rather than keeping my own objects (the XMPEntrys). At least worth looking into. That way serializing it would be super easy.

Harald K

jasonwpalmer commented 9 years ago

OK - so yeah, I think I was definitely over - thinking this. I was thinking that we would have to validate all xml and make sure that it adheres to the data structures defined in the XMP spec, but after reading your response - I think this is incorrect (thinking in terms of an API) - we should accept any XML that the client code gives us. It'll simply be a chunk of XML in XMPEntry. The DOM API will handle making sure the XML is well-formed and all, but as far as validating everything and inspecting all the properties to guarantee they adhere to the spec - this is where I was going overboard, I think anyway. :) Of course there will be code to check proper namespaced property names, but I won't go much further with validating the XML itself.

If the client passes bad xml the DOM will throw an exception and the client will need to pass better parameters - not our problem.

If the client passes xml that is well-formed, but the xml itself doesn't adhere to the spec - then we are in the same boat right? The client again needs to pass better parameters - not our problem.

That being said - yes, I think I have a pretty good handle on how you wrote XMPReader.

You already figured Entrys would be XML Nodes.
You have a CompoundDirectory that can hold multiple RDF Fragments. This makes sense because it looks like it is possible to encounter multiple Fragments in a TIFF stream.

Question... XMPScanner: Isn't it possible to skip through the TIFF stream missing the image data based on the IFDs and offsets that you encounter? I ask because it looks like XMPScanner reads every byte? Do we have to do that?

Other than that - I'm working it out. I'll keep posting as I make progress so you can intervene if you think I'm off course. Have fun with the kids this weekend. I'm actually taking all the kids up to their grandparents next weekend.

Jason

jasonwpalmer commented 9 years ago

So I think my approach is wrong.

We should take in the parameters, but as a part of the parsing process - we ONLY build what we know to be a VALID XMP Fragment. Anything less gives us an XMPWriter that might be better named XMPGarbageWriter. So instead of using the XML passed in - we inspect it for supported XMP properties and build a conforming XMP Fragment.

I might be missing the obvious, but it doesn't appear that I can use XMLSerializer as is. I would need access to an OutputStream and ImageOutputStream won't help much. Or so it seems anyway. You can probably tell me what I'm missing.

So this is what I am working on so far... XMPWriter

Adds a few constant XMP properties such as: xmp:CreateDate, xmp:modifyDate, xmp:MetadataDate, xmpMM:InstanceID, xmpMM:DocumentID, xmp:OriginalDocumentID, x:xmpmeta, and optionally a xmp:CreateDate if one is not passed in as metadata.
Parses incoming XMPEntry XML String fragments. I'll inspect the fragment and if I find what I believe to be is a supported XMP property - then I'll add it to the rdf:Description Node. If I don't recognize the property - I won't allow it through to the finished DOM.
Will not allow client code to override any of the defaults (1.) and will remove them silently if found (and in conflict).
Write the Header (<?xpacket begin=)
Write the Serialized DOM
Write 2K to 4K of \u0000 As Padding.
Write the Packet Ending (<?xpacket end='w'?>)

I think this is better than allowing the client to pass garbage :)

Also thinking about a MetadataXMPSync class that can handle the different directory types from TIFFImageWriter and make sure the properties are in sync as required by the spec. What do you think?

jasonwpalmer commented 9 years ago

I'm thinking too hard.

I am going to expect either a single XMPEntry with a single rdf:Description element optionally wrapped in a single x:xmpmetadata element to support short hand. The Entry will be keyed to inform the XMPWriter that it has Short Hand attributes. OR I am going to expect multiple XMPEntries each with a single XML fragment that will be able to be parsed as a child Element of rdf:Description using the recommended prefixes as stated in the docs. OR No XMP is passed. THEN I will parse as such and build an in-memory representation of the incoming XML document. I will then build an outgoing XML document complete with rdf:Description Element and defaults using the recommended prefixes - then I'll traverse the incoming XML if it has been found and merge it with the outgoing rdf:Description Element that I have built in memory. I will do some inspection to make sure the found short-hand or disparate chunks of XML are in compliance with the spec. THEN I'll wrap it as per the spec and add padding. THEN I'll write it out.

I think this will give us a decent base to start from. And as always, and my favorite part, we can improve it as needed from there.

The problem with my approach above is that it doesn't allow for the client code to pass the XML in a naturally hierarchical fashion - instead, they need to either pass 1 chunk with Short-Hand attributes or seperate chunks with a restricted prefix. It works, but it is very limiting for the client. I think this can be overcome by improving it to include the ability to pass a single XMPEntry with an entire XML Document embedded. This way the client can add their own XML with their own namespacing - including any custom or extended XMP XML - but round one won't support this.

This is where I am at (today anyway).

jasonwpalmer commented 9 years ago

I don't know if this is a bug or not, but I noticed in EXIFEntry you have: case EXIF.TAG_DIGITAL_ZOOM_RATIO: return "DigitalZoomRation"; Is it supposed to be returning DigitalZoomRation and not DigitalZoomRatio?

haraldk commented 9 years ago

Hi Jason,

Sorry for late reply. I'll try to answer (at least some of) your questions. :-)

You are absolutely right, it should be "DigitalZoomRatio" (fixed, will push later, along with some other minor stuff). Good catch!
The XMPScanner isn't of use in a TIFF file, as we know exactly where it starts after parsing the TIFF structure (the XMP tag points to it). It's just there to (possibly) allow inline editing of XMP, without having to know the container format.
In my head, the API clients should not have to deal with XML, nodes or attributes. It should only deal with XMPEntries (the XMPReader/XMPWriter should hide all XML parsing/DOM, or perhaps leave the DOM in the XMPDirecotry/XMPEntry). This might be what you are thinking too, but it's not completely clear to me at the moment. :-)
About padding: If we edit the file "in place" (and the xpacket is "w", for writable, which I think it always is for TIFF) we should probably not add padding (but just overwrite the content, and ~~hope~~ make sure we don't write outside the padding area, I think that is what it is for). If we write a new file, we can add a default amount of padding.
About validation: Yes, we want some kind of validation. But also not too strict, as we're unlikely to implement the entire spec and in a future compatible way... So, I'd say leave it to a minimum for now. Perhaps look into validating against the XMP XML schema in the future.

Regards,

Harald K

jasonwpalmer commented 9 years ago

Wow - I didn't realize that you are already completely parsing and reading the XMP in XMPReader. I should have started there.

I don't see why I would have to change anything. Or why I would change anything in terms of what/how to parse XMPEntry, XMPDirectory, or RDFDescription.

I wasn't paying close enough attention when you said the building blocks are all in place. I don't plan on changing a thing. :) (While I did want to rename RDFDescription to Resource - because that is really what it seems to represent.)

I was able to parse the XMPDirectory returned from XMPReader.read(stream) and build the correct DOM. I believe that is the hardest part - so now I am just going to go over it and add some minor validation - like making sure Date's are Date's etc.. And finish it up over the next day or two.

Harald - your library kicks ass. :) It is brilliantly simple.

haraldk commented 9 years ago

Hehe.. ;-)

Okay, sorry, I though you knew about those classes... But, yes, I think that really is most of what you need. It's basically the writing part that is missing. Also, I think the XMPDirectory is immutable at the moment. We might need a separate class for mutable directories.

Good stuff anyway! Looking forward to see the result!

Harald K

haraldk commented 9 years ago

Okay,

Just pushed a few changes. You want to sync as soon as possible. Sorry. ;-)

Important bits: An abstract MetadataWriter class for your XMPWriter to extend. And a companion abstract test case, that you should extend (not doing much at the moment, but still useful).

Also, a crude version of IPTC writing is in there, along with lots of minor changes and fixes.

Harald K

jasonwpalmer commented 9 years ago

No worries. You did tell me.

I suppose at some point you could have said, "Hey knucklehead, stop trying to re-invent the wheel." :)

But I am thrilled that I spent a little time getting to know the metadata library, and the TIFF plugin.

I wrote a Servlet to replace ImageMagick (thank you for the Listener by the way - it allowed us to remove the libs under tomcat and properly register our ImageIO SPI classes.) itemMaster constantly has the need to edit/add/remove metadata and up until now, we were using ExifRewriter. Now I can remove Apache Commons Imaging from my TIFFWriter and use the metadata library along with your TIFF plugin.

Yes, Good Stuff!

jasonwpalmer commented 9 years ago

Harald, I am having a hard time figuring out where/how to record data size. It seems there is no existing facility to do this (with XML considering we are not storing a byte[] or simple type that can be counted like ExifWriter does to computeDataSize). So should I simply have the reader add an entry to the Directory recording size and then just ignore that entry when writing? It just starts to introduce problems because ExifWriter needs to be informed how big the 700 block will be and it cannot do that the way it normally calculates size (referring to ExifWriter) because XMPEntrys will only have raw values that are not directly inserted and therefore can't be calculated beforehand. What do you think? Do I just add an entry with size in XMPReader?

haraldk commented 9 years ago

The simplest thing that could possibly work (that I can think of just now), would be to serialize the entire XMPDirectory (along with any subdirectories) using the XMPWriter to a byte[] (or a ByteArrayOutputStream), then get the size from that (and throw the serialized result away).

We could later add a cache of the result or something, but as long as the serializing is stable (several invocations of XMPWriter.write(...) will return same result), this should be safe.

I think we'll need to handle this as a special case in EXIFWriter.computeDataSize(Directory) (or getCount(Entry)).

The EXIFWriter need to know the exact number of bytes up front, to minimize back and forth seeking in the stream...

Do you think this could work?

PS: I guess we'll get the same problem with the IPTC block...

Harald K

jasonwpalmer commented 9 years ago

Yes - I was afraid you wouldn't like the extra overhead, but yes that should work.

OK, great.

Just something to keep in mind - this eliminates the padding idea you had with editing existing - with this model, we would never know the original size and would have to simply write it out and add padding as necessary for the next reader. But I don't see why that is a problem - as long as we are writing a new TIFF.

Anyway - just thought I'd mention that part (because you might still be seeing it differently). And as always - I'm not asking you to sign a contract here :), I know it can later change as it matures. And we figure out the best way to do it. But I'm going to go with that, especially seeing that it really doesn't require any code changes to what I have written so far - good deal.

Thanks.

haraldk commented 9 years ago

Great! :-)

We probably need to support the scenario of updating XMP in-line as well, some time in the future, but until then I think this is ok (we might keep the original XMP block size as a private field on the root XMPDirectory for that use, but I don't see much need for it right now).

Harald K

jasonwpalmer commented 9 years ago

Something worth noting (thinking out loud that is...)

My version of XMPWriter utilizes 2 in-memory Maps based on the respective specs to deal with rdf:parseType="Resource" and list items (rdf:li). This allows us to keep the implementation simple for the client because they won't have to worry about providing such metadata, but it is limiting in the fact that we must ignore items not found in our in-memory Map. This means any new specs or changes to existing specs would require changes to the source code to support such elements. And I've added a few namespaces to deal pretty much anything Photoshop will throw at us as it stands.

I believe it is fine because this way it leaves the client code simple. But I want to mention that this is overcome by building the aforementioned in-memory Maps based on a slightly more complex model for XMPReader. We would need to add the extra metadata for the Maps, but then if any client code wants to add a property that requires the Map metadata - the client will be forced to provide it. It wouldn't be too difficult, probably instead of...


Map<String, Object> mapval = (Map<String, Object>)entry.getValue();
//rdf:Alt
List

haraldk / TwelveMonkeys

Write IPTC-NAA metadata. #136