Please file a fresh, new issue if you see something you want to request as a feature or report a bug on or simply talk about.
Copy&paste the bit of text that's relevant, if you want.
The notorious "Mother Of All PRs" for Qiqqa. (Folks who've worked with me before, will know the feeling. ๐๐คฏ )
Why?
Because I don't want to swamp the issue tracker with the stuff I note, think about, or otherwise need to remind myself about at some later point in time, where my brain very probably has already given up attempting to track and manage.
a.k.a. "Notes To Self"
I've been pondering dropping this stuff in the kanban projects but that hasn't worked out before, because this is about a lot of details that do matter, but are clutter for everyone trying to get a handle on the overall state of affairs (like me), so after a long time considering and trying other means, failures all, the current idea is to bundle all these devils in a single issue, with perhaps some check boxes, and then using the github EDIT feature, rather than COMMENTing each time: once an item in this list is done, it can be DELETED as far as I'm concerned. The git commit log serves well for keeping track of what happened and what was done; it's the 'gotta do / check this' buggers that need a place to go and I want to keep it all in a single website, i.e. github.
Observed Crappiness?
[ ] BibTeX editor. OMG. ๐คฏ I know it's hacky (hey, I did that back in '19, I know), but the parser is... ๐ and the (re)formatter... and the editor modes are ๐คก -- find time and effort to do a proper editor one day, please. Doesn't need to be smart, just flexible. Tolerant, say.
[x] ~antique pdfdraw.exe still locks up on some particular 'evil PDFs'~ --> filed as #305 for it happens almost every day now with my test repo ๐ฉ ๐ญ
[ ] ditto for SynFusion-based annotation, etc. metadata extraction logic: here we end up with a run-away memory consumption problem in .NET, resulting in out-of-memory a little later on.
--> work is done in MuPDF repo (mutool multipurp): multipurp tool is created to extract a metric ton of PDF metadata, including outlines, annotations and attachments and dump all that to JSON, so we can easily go through this stuff picking up what we want/need at that particular mo.
BibTeX parsing:
[ ] check the Unicode translation code: it works but a little too well, as \LaTeX and \kern TeX macros at least get their leading chars converted to Unicode, and that's plain wrong.
[ ] Go through the test set and vet the results in the sollwert files: there's very probably a couple of issues lurking in there still.
[ ] BibTeX (re)formatting? Yech, TABs! ๐คข
[ ] Qiqqa Search: search for fingerprint:HASH (e.g. fingerprint:20359B18C8D6AC93F836962526FDC306118486C) doesn't work. Would be handy debug/diag tool, while help screen says fingerprint is recognized as a field. Well... (Also does not work in global search. Obviously.)
[ ] how about letting user edit it in another tool, then re-import the spreadsheet and bulk-update the metadata that way?
[ ] I loath WPF (OK, we already knew that, tell me something new, will ya?); ditch the crappy table in the Sync Manager dialog and replace it with a ReoGrid instance, i.e. make it act like a spreadsheet too. ReoGrid says they can do readonly on cells. Yay! ๐ฅณ
[ ] use a Nuke-like node editor interface for the OCR process. Find a way to have it communicate with mupdf+leptonica+tesseract.
๐ซ As different PDFs require different processing, should we store the graph per PDF?
[ ] Better to link PDFs to a "OCR template" of some kind: that way we get the user automatically keep them "grouped" that way.
[ ] Ditto for text extraction!
Remember your own bloody OnSemi (ex-Fairchild) datasheets with obnoxious cover sheets!
Remember those Taiwanese Uni PDF papers you have, which have their own flavour of more-or-less useful cover sheets to strip off / reduce.
Oh! Oh! The cover sheets of that scientific PDF aggregation site you got them from. Suppressed the name, but you know which one I mean! Almost white sheet, images of the authors and a bit of crap. Making it hard to recognize the papers as duplicates when you get them from elsewhere.
[ ] Big Binary Chunks coming from background processes (node, mupdf, tesseract, solr, whatever): Named pipes for bigger loads or is that overkill and do we push everything through a socket anyway? (copy bin blocks vs memory mapped I/O for large data chunks)
[x] Given what I read about it from others, plus personal experience in the past, mmap() is superb, but for this cross-platform stuff we'ld better stick to named pipes or localhost TCP loopback: the latter being the most generic while pipes would be great, but at least named pipes on Windows are visible outside the machine, thus posing a security risk when I don't do something smart about it: https://docs.microsoft.com/en-us/windows/win32/ipc/named-pipes (see the bit there about NT AUTHORITY\NETWORK.
[ ] memory mapped I/O only works really well when you know the required block size up front, so an alternative might be sending handles to memory mapped areas to share among processes, instead of copying data like mad through pipe or TCP/IP (short-circuited) IPC. Still, that's more work to code and ensure it's robust, so I guess we'll stick with the pipes and local-loopback TCP IPC instead as that's standard fare and also maps very well onto the idea of running the important stuff in server apps on the local box (civetweb, solr, ...)
[x] currently C# code shuffles and copies the image data multiple times at a horrible performance pace: C# isn't exactly beautiful for image processing (or rather: I'm not versed well enough in the language+libs to make this crap run fast, more probably.
[ ] "next stage" would therefor be either checking out the ways pros do image processing in C#, or we move that crap to a C++/C process and deal with the binary blobs traveling across the barrier every time. ๐ข ๐ข
[ ] tesseract is integrated with mupdf by Artifex ๐ ๐ ๐ ๐ฅณ -- now open that bugger up and permit scriptable/configurable image processing insidemupdf-as-you-have-it, so we can optimize OCR. Drive this with the Nuke-style node editor mentioned above.
I mention Nuke, because that's what my brain comes up with for this Rorschach Blot, but it's more like Autodesk 3dsmax's Material Editor and their ilk: intermediate or fundamental as visual thumbnail in each node to help track where happens what:
[ ] consider adding OpenCV lib to the mupdf tool, next to leptonica (which is already the default part of the pdf-to-tesseract path in there), so we have additional image filters / processing for the OCR workflow. --> prepwork for facilitating pre- and postprocessing of PDF image and text data: https://github.com/GerHobbelt/owemdjee
[ ] add a 'webserver' mode to mupdf (mutool + mudraw: both for metadata extraction, text extraction, OCR processing (and consequent text extraction) and PDF page to image rendering) so we don't have the (costly?!) tool startup time for every page or bit of (meta)data we're requesting.
[ ] be prepared to kill the bugger or expect the bastard to lock up or crash due to nasty PDF inputs once every couple of documents: feed the entire evil library through it, and then everything else you can grab off the Net.
Reminder: the SynFusion libs b0rked out with a HUGE memory leakage just today, and that was only because Qiqqa was doing a bit of annotations extraction via that one today. We got rid of SORAX, but SynFusion is on its way out too. ๐ข
-[x] --> mutool multipurp is a new tool in the mupdf palette, derived off mutool info and the mupdf gui app. multipurp dumps all available metadata for the PDF in JSON format. This includes attachments, annotations, etc.
[ ] see if I should revive my old mongoose clone for this, or grab another light, embeddable C/C++ web server that can do JSON and a whole lot more.
[ ] what's it called now? Or was it the original with MIT license that got renamed? Don't recall precisely. There was a license change back then with some brouhaha, but since Qiqqa and mupdf are (A)GPL, it doesn't matter no more, right?!
[ ] I'm not looking at nginx and its ilk: way to much overkill and I've working with them before: I want an embeddable webserver, which can run on localhost for as long as Qiqqa is alive: started and stopped by Qiqqa, preferably.
[ ] Analyze Qiqqa PDF page render behaviour (we're doing that now) and see what we can do to improve performance there by caching some sort of LIFO / timestamped cache, where we can age off old slots. No more 3 images per PDF static crap.
[ ] Please wrack your brain for a smart solution for the many 'get it as we need them' situations in the code, where we SafeThreadPool the work, but should ideally should:
[ ] respond with a fake answer immediately, so the UI gets rendered pronto. (Think the grey boxes on page load you see happenin' with modern websites, which perform async content loading)
[ ] don't care about effin' WPF INotifyWhatWasIt crap and code bloat, "required" for UI updating. Come up with something leaner and preferably faster / as fast once the 'async' data finally arrives.
[ ] extra bonus points when you go through the code and make sure the final result is checked against the current state of affairs in the UI to ensure the data is not already outdated, because the user has moved on and another panel or PDF is currently in view, while the work is about a PDF that's already closed again.
[ ] extra extra bonus points if you can come up with a solid schema where all this work is abortable without a lot of hassle. It happens in a zillion places and I'ld rather not clutter the whole codebase with CancellationTokens processing, if I can get away with something leaner, preferably much leaner.
[ ] Make that bugger an external lib or use the Zotero scripts via NodeJS, maybe?
[ ] Idea: any BibTeX or whole DB metadata records (JSON) which do not parse should NOT BE LOST: dump these in a special field called 'b0rked' or something, so we can add a UI or tooling process to post-process those. Particularly important when processing damaged or VERY old Qiqqa libraries, where some stuff seems to go wrong, but we cannot find out what exactly goes haywire.
NodeJS / async JavaScript
[x] Ran into this today, keep in mind when moving to electron or doing other JS-based async work:
process.exit() will immediately terminate any pending promises!
brandonmpetty/Doxa: A Local Adaptive Thresholding framework for image binarization written in C++, with JS and Python bindings. Implementing: Otsu, Bernsen, Niblack, Sauvola, Wolf, Gatos, NICK, Su, T.R. Singh, WAN, ISauvola, Bataineh, Chan and Shafait.
glefundes/Multimethod-Binarization: Efficient implementation of local thresholding image binarization in python for use in multimethod binarization OCR
[x] my old work on nearly 10 year old mongoose (a.k.a. civetweb) has been revived and still works, including the 'book sample server app' with drag&drop GUI.
[ ] Had forgotten civetweb came with embedded Lua for scripting. Soit.
[ ] Can serve me well as test server around mupdf & friends.
[ ] Do a web page for the releases as there'll be several flavours all the time (production, beta, raw test)
[ ] Serialization:
to disk (persistence)
[ ] JSON: UTF8Json for speed? Or regular JSON?
to database (persistence in database records)
[ ] JSON: UTF8Json for speed? Or regular JSON?
Keep in mind that we're considering NOT having everything in memory, i.e. query database on demand! This would benefit from faster data I/O. Nevertheless, for diagnostic purposes, it might be best to stay with a human-readable format such as JSON. Otherwise, see below for binary protocols (FlatBuffers, etc.)
between processes (NOT PERSISTED.)
This concerns processes where we have reasonable/full control of both ends: PDF processor, frontend (+ business logic layer? or do we have that one separate? If we use Chromely or electron, you're transferring data as message between backend layer (C#/node) and browser/UI frontend anyway, so there is another interface comms layer, however you turn it)
Processes where we DO NOT have full control (or don't want to patch one side to gain control) will generally speak JSON (or XML?): SOLR//ES
[ ] Ceras (just to keep in mind for slightly different purposes... Not really a candidate here.)
[ ] Bebop - while I'm an old fan of Cowboy Bebop too ๐ this one's currently lacking a C/C++ target. If that can be tweaked from the JS target, this might be the coolest one yet, though certainly not as mainstream as Flatbuffers. Has C# and JS/TS targets, so that's covering the other 66% for me.
[ ] electron (so mainstream. And yet... Why haven't I switched already? --> Because somewhere inside my brain something is going NO, but I'm not entirely sure why. Except probably the reflection on Brook's Second System Syndrome that this would evoke at max power setting, perrr-haps? ๐คก
[ ] NW.js (better integration of UI frontend into backend layer, where a thin layer of Qiqqa business logic would reside. Promised faster UI updates that way as it takes out one hefty comms interface.
[ ] Chromely (no nodejs, backend layer is C#. Which is fine and might help me moving as the overall "business glue" logic wouldn't need to be rewritten from scratch. That might save a bundle.
[ ] considering the new identifier hash (I have two PDFs colliding on SHA1 in the evil PDF corpus; since Qiqqa uses a 'stripped' version of SHA1 (bug since start of life), there's more collisions probable. Need the same keying system for other files too, BTW.
This particular issue is For Developers Only
Please file a fresh, new issue if you see something you want to request as a feature or report a bug on or simply talk about.
Copy&paste the bit of text that's relevant, if you want.
The notorious "Mother Of All PRs" for Qiqqa. (Folks who've worked with me before, will know the feeling. ๐๐คฏ )
Why?
Because I don't want to swamp the issue tracker with the stuff I note, think about, or otherwise need to remind myself about at some later point in time, where my brain very probably has already given up attempting to track and manage.
a.k.a. "Notes To Self"
Observed Crappiness?
pdfdraw.exe
still locks up on some particular 'evil PDFs'~ --> filed as #305 for it happens almost every day now with my test repo ๐ฉ ๐ญmutool multipurp
):multipurp
tool is created to extract a metric ton of PDF metadata, including outlines, annotations and attachments and dump all that to JSON, so we can easily go through this stuff picking up what we want/need at that particular mo.\LaTeX
and\kern
TeX macros at least get their leading chars converted to Unicode, and that's plain wrong.fingerprint:HASH
(e.g.fingerprint:20359B18C8D6AC93F836962526FDC306118486C
) doesn't work. Would be handy debug/diag tool, while help screen says fingerprint is recognized as a field. Well... (Also does not work in global search. Obviously.)Crazy Ideas To Try?
mmap()
is superb, but for this cross-platform stuff we'ld better stick to named pipes or localhost TCP loopback: the latter being the most generic while pipes would be great, but at least named pipes on Windows are visible outside the machine, thus posing a security risk when I don't do something smart about it: https://docs.microsoft.com/en-us/windows/win32/ipc/named-pipes (see the bit there aboutNT AUTHORITY\NETWORK
.[ ] add a 'webserver' mode to mupdf (mutool + mudraw: both for metadata extraction, text extraction, OCR processing (and consequent text extraction) and PDF page to image rendering) so we don't have the (costly?!) tool startup time for every page or bit of (meta)data we're requesting.
Reminder: the SynFusion libs b0rked out with a HUGE memory leakage just today, and that was only because Qiqqa was doing a bit of annotations extraction via that one today. We got rid of SORAX, but SynFusion is on its way out too. ๐ข
-[x] -->
mutool multipurp
is a new tool in the mupdf palette, derived offmutool info
and themupdf
gui app.multipurp
dumps all available metadata for the PDF in JSON format. This includes attachments, annotations, etc.localhost
for as long as Qiqqa is alive: started and stopped by Qiqqa, preferably.process.exit()
will immediately terminate any pending promises![ ] Serialization:
to disk (persistence)
[ ] JSON: UTF8Json for speed? Or regular JSON?
to database (persistence in database records)
[ ] JSON: UTF8Json for speed? Or regular JSON? Keep in mind that we're considering NOT having everything in memory, i.e. query database on demand! This would benefit from faster data I/O. Nevertheless, for diagnostic purposes, it might be best to stay with a human-readable format such as JSON. Otherwise, see below for binary protocols (FlatBuffers, etc.)
between processes (NOT PERSISTED.)
This concerns processes where we have reasonable/full control of both ends: PDF processor, frontend (+ business logic layer? or do we have that one separate? If we use Chromely or
electron
, you're transferring data as message between backend layer (C#/node) and browser/UI frontend anyway, so there is another interface comms layer, however you turn it)Processes where we DO NOT have full control (or don't want to patch one side to gain control) will generally speak JSON (or XML?): SOLR//ES
[ ] Google FlatBuffers (flatcc for C)
[-] ~msgpack~ (nah...)
[ ] FBE (as a faster flavor/deriv of Flatbuffers)
[ ] Ceras (just to keep in mind for slightly different purposes... Not really a candidate here.)
[ ] Bebop - while I'm an old fan of Cowboy Bebop too ๐ this one's currently lacking a C/C++ target. If that can be tweaked from the JS target, this might be the coolest one yet, though certainly not as mainstream as Flatbuffers. Has C# and JS/TS targets, so that's covering the other 66% for me.
Just some stuff about FlatBuffers:
electron
orNW.js
(here's some reasons why I would ride that one rather than electron for something like Qiqqa -- shoot! browser crash & links lost! ๐ญ Anyway, older: http://my2iu.blogspot.com/2017/06/nwjs-vs-electron.html and google stats: https://trends.google.com/trends/explore?q=nwjs,Electron%20js,Chromely,nw.js) orChromely
?nodejs
, backend layer is C#. Which is fine and might help me moving as the overall "business glue" logic wouldn't need to be rewritten from scratch. That might save a bundle.fravia
back in the day.)scandir()
tek for the Watch Directories: that's an optimized glob which supports.gitignore
, etc. in there!