guardian / grid

The Guardian’s image management system
https://www.theguardian.com/info/developer-blog/2015/aug/12/open-sourcing-grid-image-service
Apache License 2.0
1.44k stars 120 forks source link

Add support for IPTC Subject Codes #4024

Open paperboyo opened 1 year ago

paperboyo commented 1 year ago

IPTC Subject Code is a newer version of old IPTC IIM Category and Supplemental Category fields. Currently, Grid only understands those ancient IIM fields and only when stored in IIM part of the metadata. Grid should be made to:

  1. understand their XMP versions (xmp.photoshop:Categoryand xmp.photoshop:SupplementalCategory respectively) and read them here
  2. understand newer (but still legacy) IPTC Subject Codes and map those to existing list of Grid Subjects. Mappings are available here.
  3. We could also spend some time to take a look if the list of Grid subjects could be made more useful and if imagery from suppliers wouldn’t allow for a more useful list. I don’t think Subjects should provide any extensive ontology. They are far more useful as a quick, short list to exclude/include big unwanted chunks of the corpus. The fact that IPTC went more and more extensive married with a fact that newer and newer schemas enjoy less and less support seems to back up this view. But I can see one useful fix: separating fashion and catwalk from Arts (if possible via metadata sent to us).
  4. Current IPTC recommendation is to use even more extensive CV-Term About Image. But, among ~53mln images, a single one having this property is… an IPTC test image, so we have another decade to worry about that, I guess 😜.
honorcb commented 1 year ago

You should be looking at https://iptc.org/standards/media-topics/ , the replacement to the Subject codes . The work to replace the Subject Codes was started by a small group of members from BBC Scotland, AP and PA in about 2003,

paperboyo commented 1 year ago

IIUC, media topics are supposed to be a newest version of Categories/Subject Codes. They use controlled vocabulary newscodes. But those (for media topics) are written into CV-Term About Image, right? And not one of XMP fields for the whole structure of CV-Term is available in a single image in our corpus.

Am I wrong and are they saved into some other XMP field? Have you seen them anywhere in the wild?

In any case, I still think for those to be useful, we need some short and manageable list. Sadly, not even this is supplied by some of our biggest suppliers…