Tools suite (umbrella issue)

cole-miller commented 3 years ago

DjVuLibre comes with a bunch of command-line tools that perform encoding/decoding, dump document structure in human-readable or XML formats, extract chunks, etc. Replicating these will be a great way to exercise the APIs and get something useful out the door.

cole-miller commented 3 years ago

None of these require more than sndjvu_codec::bzz:

[ ] BZZ encoding/decoding tool [blocked on #4]
[ ] DjVu dump tool (human-readable and JSON output) [blocked on #8]
- DjVuLibre does XML output, do I want to implement that? and if so, matching the DjVuLibre format or de novo?
[ ] a tool for indexing/slicing/concatenating multipage DjVu documents
[ ] ~~a tool for extracting text from DjVu documents~~ generic tool for extracting the contents of a specific chunk in a document

Will also need some example documents to test-drive these…

cole-miller commented 3 years ago

I've been thinking about the design of a sndjvu-extract tool. Something like

$ sndjvu-extract -S SELECTOR <input.djvu

where SELECTOR specifies a single chunk using a syntax like file_id/chunk_id#index. With some additional options to deal with decoding, etc.

cole-miller commented 2 years ago

There's actually some code in the sndjvu_toolkit crate now, hooray! The idea is to have one binary that does argv[0] dispatch to determine what tool to run. (Eventually you'll be able to compile a binary with only the subset of tools you care about, controlled by features.) Tools that have at least a little code are sndjvu-bzz and sndjvu-dump. I'd like the second of these to support plain (like djvudump), XML (like djvuxml), and JSON output, and S-expressions would be nice too :).

I wrote a working prototype of sndjvu-dump ("plain" output only) that printed its output line by line while Visiting the document. This has a couple of nice properties: if there's a parse error you still see all the preceding lines of output, and you can re-use the same BZZ output buffer for almost all the decoding (except you need a separate buffer for the DIRM stuff). But for the more "structured" output it seems clear that we need to parse the document completely into a proper data structure (sndjvu::simple_document::Document) and then walk that, instead. Maybe the original, eager sndjvu-dump will come back as a separate tool -- could be useful.

sndjvu / workspace

Tools suite (umbrella issue) #3