Books Preview - Githubissues

holtzermann17 commented 11 years ago

Here is an example of a preview for the book project.

Summary

Lifecycle stage: 1 --proto--[ 2 ] ->- 3 --evolving-- 4 ->- 5 --complete-- 6 ->- 7 --mature-- 8

Sources: (a) Retrodigitization, (b) importing content from other CC-By-SA or more liberally licensed sources, (c) re-using internal content.

In more detail:

Library of Congress + US Copyright office + Archive.org + Infty + some manual labour (we've successfully run this workflow with our Calculus book)
Enhancing material from websites such as Stackexchange and Wikipedia with cataloging, links using NNexus, and the like.
Reusing existing material from the PlanetMath encyclopedia in collections (e.g. exploiting NNexus or other automated tools to assist with content assembly)

We have done a lot of background research on this. In order to get things moving and progress to Stage 3, we need some hands-on-the keyboard time (mathematics background helpful throughout, programming experience helpful for b and c). By default, this will evolve slowly as we assemble new courses and make further experiments with NNexus. However, an influx of vounteer time (or funding) could make things progress more rapidly.

MOCK-UP/DEMO

Put in some artistic impression of what the books section might look like.

DETAILED DESCRIPTION

The purpose of the book project is to make mathematical books in the public domain accessible to the general public in the form of a collaborative digital library. To accomplish this goal, we plan to design and build a system comprised of three interoperating components.

The first subsystem is a retrodigitization toolchain. When complete, this system will allow one to start with a phyiscal book on a library shelf, scan it in to a computer, then subject the result to a series of processing steps which result in a TeX representaion of the book's contents. While the software for this already exists and has been tested, there is room for improvement; by introducing image preprocessing, clustering, and postprocessing, one should be able to significantly improve the accuracy of the process. Given that proofreading and correcting errors is a labor-intensive process, the labor saved by improving the OCR process justifies the effort.

Since, even with these improvements, this process is not 100% accurate, we need the next component, which is an editorial workflow. Based upon the CBPP approach which has been in use for the last decade to produce the PM encyclopaedia and inspired by predecessors such as the St. Pachomius Library and Project Gutenberg's Distributed Proofreaders, this system will coordinate the proofreading of mathematical works by members of the PM community. To participate, a member would start at a page which lists the various works which have been processed but not yet proofread. Upon picking a work, the member would be assigned a page. To work on the page, there would be a webpage which displays the original text, the computer output from the OCR suite, and the rendering of that output. The proofreader's job is to ensure that the rendered output agrees with the original text and, if not, to edit the output as appropriate. Once this is done, an editor will double-check the result and, once all pages have been satisfactorily edited, the system will collect the results and collate them into a hypertext edition.

The third and final component is a reading room which makes the results available to the reading public. To locate books, there will be a catalogue, search facility, and recommender. Once one has located a book, one can read it in several forms. The primary form is hypertext enhanced with links to the encyclopaedia, cross links to other books, notes, reviews, problem solutions, and the like. There will also be files of the book available for downloading and viewing on an e-book reader or printing out. In line with the philosophy of library as a social space, there will be plenty of opportunities for readers to interact with the text and each other by making notes, reviewing books, and participating in discussions.

In addition to these three components, there will also be an area for supporting the project and the PlanetMath organization by sponsoring books and purchasing hard copies.

ROADMAP

Install OCR program and process a first book.
Conduct preliminary research on OCR techniques.
Collect suite of samples for OCR evaluation.
Examine effects of preprocessing strategies.
Write utility to extract graphic images of individual characters from scans according to XML OCR output.
Compare effects of different feature vectors, metrics, averaging techniques, and clustering algorithms.
Determine statistical distributions of features and metrics and develop statistical models of identification.
Study how to feed output of clustering and average back into training.
Study distributions on lines and techniques for isolating characters and combining fragments of characters.
Study postprocessing techniques.
Study how to convert the "visual" TeX markup produced by Infty to more "semantic" TeX markup.
Develop techniques for automatically extracting structure and metadata for books.
Research techniques for combining symbols into equations such as, say, hierarchical clustering.
Figure how to combine the various programs and techniques into a toolchain so as to maximize correctness.
Improvise proofreading of first few books using the existing facility for editing encyclopaedia entries.
Compare different strategies for presenting text to be proofread and highlighting questionable identifications.
Revise 2005 specification from Noosphere to Planetary.
Implement proper proofreading facility.
Implement facility for keeping track of books and editorial workflow.
Implement facility for outputting completed books.
Test and document facilities for prooofreading books.
Collect and write converters to produce versions of books in various file formats.
Present the first few books using collections facility.
Enter in math books from Project Gutenberg.
Incorporate books into indexing and search.
Study and compare algorithms for recommending books.
Implement reading room.
Test and document the reading room.
HOW TO HELP

If you're a philanthropist, your donations will help move the research and development process along:

$1000 will purchase an InftyOCR license.
$2000 will purchase a high-end computer for OCR and related processing.
$5000 will pay for an OCR research assistant.
$10000 will pay to implement the books section on PM

If you're a Drupal dude, you can help implement the proofreading facilities and reading room.

If you're a script kiddie, you can help us build our toolchain.

If you're into statistics, you can help us with identifying characters by clustering.

If you're an proofreader, you can help us prepare the first few texts.

ACKNOWLEDGEMENTS

Thank people who have helped with the initial steps in the roadmap.

holtzermann17 commented 11 years ago

My first comment is that the detailed roadmap is great! This preview seems ready to be converted into its own set of issues in an issue tracker, with work commencing whenever we're ready for that. Indeed, some indication of the progress made along the roadmap would help contributors get motivated about getting involved.

This is also quite reminiscent of the Seed Projects that we've written about in the Free Technology Guild project, see this page. The FTG seed projects use a slightly different but almost analogous template. Again, I think the fact that you've already broken the roadmap down to detailed do-able steps is a big advantage, and I'd suggest that other Seed Projects use this as a model.

To conclude: I would see the PM Previews series as being parallel to the FTG's incubator function. If we can make the other high-level summaries I advanced in #34 into similarly-detailed outlines, I think we'll have a very nice map for ourselves and any others who would like to join.

Thanks very much for contributing the model seed project @rspuzio!

holtzermann17 commented 11 years ago

Some issues directly related to books: https://github.com/KWARC/planetary/issues/340, https://github.com/KWARC/planetary/issues/332, https://github.com/KWARC/planetary/issues/336, https://github.com/KWARC/planetary/issues/341

One issue more related to collections: https://github.com/KWARC/planetary/issues/216

General improvements related to Git integration and a build system would probably be useful here: https://github.com/KWARC/planetary/issues/68, https://github.com/KWARC/planetary/issues/67

Then there are a bunch of OCR- and proofreading-related issues that we need to outline (some of that may also be relevant to Planetary, but other bits should go elsewhere).

holtzermann17 / planetmath-docs

Books Preview #37

Summary

MOCK-UP/DEMO

DETAILED DESCRIPTION

ROADMAP

HOW TO HELP

ACKNOWLEDGEMENTS