jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

support MHT/MHTML/CHM next to PDF: ability to store, read, index, search web pages & other publications #329

Open GerHobbelt opened 3 years ago

GerHobbelt commented 3 years ago

Why MHTML? (β©Ά MHT)

A lot of material can be kept in PDF very nicely. Printed papers, presentations, etc.

While I don't want to "succumb" to supporting all kinds of formats available out there (and thus replicate Apache Tika + Solr server installations: that's where you're going functionality-wise, no matter what software you use to code this in), I've felt a long time that the HTML format should be good thing to support as a equal rights citizen next to PDF: anything useful can be converted to HTML or PDF, where HTML might be easier for the non-paged documents out there, such as publications / whitepapers which are published solely as web pages such as blog articles, wikipedia entries, and so on: there's no clear page break in there anywhere, which makes it a bit bothersome to convert to PDF without losing something of the look and feel in the process: any page number you end up with is artificial.

HTML, however, is not designed to keep the styling (that's what CSS is for and most pages do use CSS files next to inline CSS) while images included in a HTML pages are almost always separate files themselves: if we want to store such a "HTML page" as seen by the human user with minimal effort and maximum fidelity, then our browsers offer a "bundler format": MHTML (a.k.a. .mht / MHT in some browsers), which is a simple archive of the required CSS and image files, bundled with the HTML file in such a way that we can view that web page off-line at any time.

Just one example of a "man page" like document which is not paged i.e. page/print-formatted:

https://en.cppreference.com/w/cpp/atomic/memory_order

which would be stored (and indexed, and searched, and annotated) in its original continuous flow format: HTML --> MTHML -- for we want to keep the styling, etc. intact and at least copied the entire entity locally so we achieve/archive a fully independent copy in our library!

Why CHM?

😊 πŸ˜… While that one is not absolutely necessary, I do have a heap of documentation in that format, which is basically a bundle of a set of HTML pages all kept together.

Granted, we're balancing on the edge (or already fallen off at the wrong side here?) with this one, as there's no strong argument for it other than that some stuff comes in sets of HTML pages: MHTML will not be helping us there, so this is the closest thing to "MHTML for multiple related pages", like publications which were done in "installments", etc. (web-published ebooks, tutorials in multiple installments, ...). Here the criterium to use CHM instead of PDF is again: it wasn't paged or nicely pageable material to start with, so it would be nice to keep the original flow and styling intact: PDF isn't exactly suitable here as HTML pages (of the individual installments of the article/publication) have variable length, which is not suitable for direct-to-PDF conversion without getting those less-than-optimal page breaks and senseless resulting page numbers.

Why not PowerPoint, Excel, etc?

Presentations (apart from their animations) map to paged media very nicely, hence PDF would be a perfect target.

Are the animations (dynamic content) important, then a conversion to HTML would be more suitable. Remember, I'ld like very much to stay away from spending my time supporting a zillion input formats, as that would not only entail running or producing something like Apache Tika, but also a powerful viewer & annotation editor which must support all these file formats: seems like a tough job for a single camel, right? πŸ˜‰

Excel spreadsheets and other 'data lists' (or should we call them "data tables" or "data cubes"?) are important, but do worry the "document library" paradigm of Qiqqa, i.e. how far do we need to stretch to explain this one as yet another document format where we expect to content search -- like we would expect to do with our documents? Meanwhile, Excel sheets, etc. can be transformed ("snapshot?") to HTML pages, so we can have Excel sheets in our document store that way then. For the true aficionados, a multi-tabbed Excel spreadsheet (i.e. a "3D spreadsheet") would then map to a multi-page HTML set, which is me selling that CHM format to myself for this purpose as well: see how useful it can be? 🀑 πŸ˜‡


MHTML + CHM covers everything with the need for one renderer only

Since we need/have a full fledged HTML on board for other purposes already (Web Sniffer, Browser, later also for: UI), anything that's basically HTML+CSS+IMAGES is "supported natively" by that renderer and thus expected to be "minimal effort".

MHTML is good for non-paged single publications.

CHM (or something like that) serves well for "multiple page" publications which have "variable page length" per "page", e.g. multiple-installment publications on a blog/website.

GerHobbelt commented 3 years ago

Note: might be smarter to use EPUB format instead of CHM: too bad for CHM but EPUB smells more like the future, despite it having been restricted in what it can and will handle in terms of CSS, etc.

EPUB 3.2 basically reads as HTML+CSS+JS bundled in ZIP with some extra metadata files and has wider support on computing platforms than does CHM.

Do NOT expect an easy ride here, though, IFF you want to have a format that's actually readable by third party tools and other platforms:

Oh, and annotations is something nobody seems to bother with in EPUB / CHM land... 🦈