gotson / komga

Media server for comics/mangas/BDs/magazines/eBooks with API, OPDS and Kobo Sync support
https://komga.org
MIT License
4k stars 236 forks source link

[Feature Request] Support for PDF Metadata #277

Closed hot22shot closed 3 years ago

hot22shot commented 4 years ago

Describe the solution you'd like

Like it was done for ComicInfo and EPUB file formats, I'd like to know if you plan to support PDF metadata extraction ?

Regards

gotson commented 4 years ago

Even though i suppose PDF would have metadata, does it happen in real life for comics?

Would you be able to share what kind of metadata you're referring to, and to which Komga field you would expect it mapped to?

garbled1 commented 4 years ago

So personally, I happen to use Komga for 2 things. 1) reading some comics. 2) easy access to my vast collection of pdf hardware manuals.. But if I look at one:

[venv] [garbled@polaris:/manuals/hardware/SuperMicro]$ pdfinfo SC847.pdf 
Title:          SC847-4U-Chassis - Super Micro Computer, Inc.
Subject:        SC847-4U-Chassis - Super Micro Computer, Inc.
Author:         Super Micro Computer, Inc.
Creator:        Adobe InDesign CS5.5 (7.5)
Producer:       Adobe PDF Library 9.9
CreationDate:   Fri Nov  8 12:38:19 2013 MST
ModDate:        Fri Jun 17 01:51:10 2016 MST
Tagged:         yes
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          188
Encrypted:      no
Page size:      388.8 x 594 pts
Page rot:       0
File size:      40988083 bytes
Optimized:      no
PDF version:    1.4

I feel like that is some great data to pull in. At least the date, title, subject, author lines?

gotson commented 4 years ago

It's a bit difficult to decide which field would go where with a sample or 1.

Do you know where this information is coming from? Are those pdfs coming from a vendor, or is the data filled by someone else before sharing it?

Are the content of the fields consistent between all your pdfs?

I would surmise that those metadata could also be filled with garbage automatically generated by conversion software for instance, and an auto import feature would produce less than ideal results.

I'm not sure the date is of any interest, it looks like the file creation date, not the release date.

Author might be of interest, but what about creator and producer?

Title and subject contain the same data, so probably subject is superfluous in that particular case.

Basically what I would need before doing anything in that direction is to get a better understanding of :

  1. What is the definition of the pdf metadata fields in general, ie what they've been designed to contain according to the spec of the pdf format
  2. Is that data usually accurate according to the spec, or is it usually complete garbage
garbled1 commented 4 years ago

Well, let me look at a bunch of files then, I have a pretty random collection of stuff, some hand-made..

From the docs of the program pdfinfo:

DESCRIPTION
       Pdfinfo prints the contents of  the  ´Info'  dictionary  (plus  some
       other  useful  information)  from  a  Portable Document Format (PDF)
       file.

       The ´Info' dictionary contains the following values:

              title
              subject
              keywords
              author
              creator
              producer
              creation date
              modification date

So I've just sat here digging through my hundreds of pdfs, and here is what I can report...

1) A few simply don't have various fields, but when they do have them, they are generally useful. 2) At least for hardware manuals, the author field is kinda all over the place, but I feel that has more to do with what I'm reading than the format of the spec. I feel like it's generally useful though.. 3) The creator field is bunk. it's pretty much the software used to make the pdf. 4) The title/subject fields are not always the same, and provide some useful differentiation. 5) The creation date or mod date would be super useful. For example, some of these are hardware specs, so knowing what revision I'm looking at would be great. "Oh, this is the 1998 spec, I need the later rev". The dates seem to more or less correspond to to spec dates inside the manuals.

Some examples:


Title:          2486Dxx
Subject:        
Keywords:       
Author:         Jen Mathiasen
Creator:        Acrobat PDFMaker 10.0 for Word
Producer:       Acrobat Distiller 10.0.0 (Windows)
CreationDate:   Fri Jan 27 18:27:37 2012 MST

This looks like trash to you, but I know what a 2486Dxx is, so thats pretty great to have.


Title:          Microsoft Word - 20060901_Roomba Service Manual.doc
Author:         tgiesecke
Creator:        PScript5.dll Version 5.2.2
Producer:       Acrobat Distiller 7.0.5 (Windows)
CreationDate:   Tue Sep  5 08:10:22 2006 MST

Maybe not the best title... but.. eh..


Pegasos_2b5.pdf
Title:          Pegasos II
Subject:        microATX dual PowerPC mainboard
Author:         bplan GmbH
Creator:        Design Explorer
Producer:       Acrobat PDFWriter 4.05 für Windows NT
CreationDate:   Thu Mar 18 08:59:02 2004 MST
ModDate:        Mon Jan  9 08:30:46 2006 MST

The author isn't wonderful there, but the subject/title are useful


PM_1.9_Install_Admin_User Guide.pdf
Title:          untitled
Creator:        FrameMaker 7.1
Producer:       Acrobat Distiller 7.0.5 (Windows)
CreationDate:   Fri Aug 18 02:33:28 2006 MST
ModDate:        Fri Aug 18 10:54:47 2006 MST

Ok that one is annoying.. but it's really a corner case.

GT724WGR-Wireless-DSL-Modem-User-Manual_1s.pdf
Title:          The Actiontec GT724WGR combines a DSL modem, wireless network
ing, and DSL router in one box. 
Subject:        User Manual
Keywords:       dsl modems,dsl,gt724wgr,adsl router modems,adsl,adsl2,adsl2+,
 broadband, dsl routers,sharing a broadband connection,adsl modems,broadband 
modems,router modems,wireless networking,home network
Author:         Actiontec Electronics
Creator:        Adobe InDesign 2.0.2
Producer:       Mac OS X 10.5.3 Quartz PDFContext
CreationDate:   Sun Aug 10 10:00:17 2008 MST
ModDate:        Sun Aug 10 10:04:21 2008 MST

Ok, wow, they went all out.

ipmi-second-gen-interface-spec-v2-rev1-1.pdf
Title:          Intelligent Platform Management Interface Specification Secon
d Generation v2.0 
Subject:        
Author:         Intel Corporation
Creator:        Microsoft® Word 2013
Producer:       Microsoft® Word 2013; modified using iText® 5.1.2-SNAPSHOT ©2
000-2011 1T3XT BVBA
CreationDate:   Thu Oct 10 14:24:32 2013 MST
ModDate:        Mon May 27 16:25:10 2019 MST

Intel seems to be really consistent with putting the author in..

2441Vdev-052013-en.pdf
Title:          Microsoft Word - 2441TH Thermostat Dev Notes 20130523.doc
Author:         jlockyer

This might look non-ideal, but wow, not having to remember that a 2441 is the thermostat unit would save me a bunch of time looking in each one..

Overview of my 500 manuals: So overall, there are a few with stupid titles, or silly author names. I would say after looking at my collection (which covers a really wide swath of manuals from processor/chipset specs to an oven manual, and some home-made stuff):

95% of them have useful titles. Author is kinda random. When it's set nicely, its super useful, when not, it's just unimportant. Like IBM puts IBM as author on a bunch, or Intel. But some places are less useful and the author is just "bob". In the ones that have both author and subject, 80% of the time it's different, and usefully so. I only found 1 with keywords. Creator/Producer is 100% software used to make the file. On most of them, the dates are at least semi-useful. They seem to correspond correctly to the data in the manuals.

Overall, I'd say that importing this data would vastly improve my library. In the ones where it has the data, it almost always improves the quality of it.

It looks like the specification is part of the PDF spec.. a quick google found that it's an optional section at the end of pre-2.0 spec PDF files, and is documented in the adobe pdf spec manual. https://www.adobe.com/devnet/pdf/pdf_reference.html

gotson commented 4 years ago

Thanks for that analysis on your files, it is really helpful.

I found some information about the different fields here. Creator and Producer are indeed programs, not people. We can safely discard them.

I ran a similar analysis on my files, which are not as tidy as yours. I found out a few things:

For example on a magazine published on 30th of April 2020, i have this:

CreationDate:   Fri Dec 31 00:00:00 1999
ModDate:        Wed Apr 29 10:11:01 2020

How would you see the usage of Subject in case it is different from Title ? Add it as a Genre in Komga?

garbled1 commented 4 years ago

So looking at mine, I see a few different types.. I almost feel like it should just be placed in the summary when available:

Subject:        PC87332
Subject:        http://www.datasheetarchive.com
Subject:         PC87309
Subject:        Tsi107
Subject:         PC87307,  PC97307
Subject:        Tsi107
Subject:        DATASHEET SEARCH, DATABOOK, COMPONENT, FREE DOWNLOAD SITE
Subject:        Tsi106
Subject:        PC87306
Subject:        PC16550D
Subject:        http://www.datasheetarchive.com
Subject:        PC87310
Subject:        PC87317, PC97317
Subject:        This document provides complete functional descriptions, electrical specifications, and physical characteristics for the LSI53C875 and LSI875E PCI to Ultra SCSI I/O processors. The LSI53C875/75E enables the connection of Ultra SCSI drives to a host system through a PCI bus. Ultra SCSI is an extension of the SCSI-3 specification that supports transfers of up to 20 Mbytes/s. 
Subject:        This document describes porting Montavista’s Hardhat™ Linux from a sandpoint 2 platform with the MPC8240PMC, the MPC755PMC, or the MPC7400PMC to the PowerPC™ MPC7450/MPC7451. It explains how to set up the development environment and how to compile, load, and run the resultant Hardhat Linux on the sandpoint MPC7450 platform. 
Subject:        PowerXpress Product
Subject:        Programming the MVME2400
Subject:        Programming information for the MCP750 Series of Single Board Computers.
Subject:        SC847-4U-Chassis - Super Micro Computer, Inc.
Subject:        MNL-1083 - 5500 - Motherboard - Super Micro Computer, Inc.
Subject:        MNL-1144 - 3420 - Motherboard - Super Micro Computer, Inc.
Subject:        User Manual
Subject:        SCREEN - Metered Rack Power Distribution Unit
Subject:        User's Guide
Subject:        Reference Guide
Subject:        Setup Guide
Subject:        Reference Guide5
Subject:        Setup Guide
Subject:        Reference Guide

I think that's a pretty good representation of mine. Most of those are chipset names, though a few have really nice descriptions to be honest. A few say user/setup/reference guide, but those seem to not be in the majority. A bunch have the release revision of the document/chip.

Only a very few had keywords, I guess tags for those if they actually appear?

gotson commented 4 years ago

The problem with Subject is indeed that there is no specification for the field. Some of yours look like a genre, some like a summary.

Keywords will go to tag, split by comma.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.