poke1024 / origami

A suite of batches and tools for OCR tasks.
71 stars 15 forks source link

modularity #20

Open bertsky opened 2 years ago

bertsky commented 2 years ago

Dear @poke1024, may I enquire your thoughts on how to best achieve/approach any or all of the following with Origami:

  1. packaging. Currently, all modules have to be loaded from the source directory by the interpreter to run them. With a distutils/setuptools packaging, these could go into the site/venv, and even shipped as wheels. The greatest difficulty is of course with dependencies (unless using conda).
  2. in-memory processing. The current batch pipeline uses zip files, which is slow and makes scaling and parallelization harder. Is there a (simple) way to replace file paths with memory objects (PIL.Image, Numpy.ndarray label masks or polygons for segmentation etc)?
  3. modularization. IIUC you have to use the batch pipeline from begin to end to get exportable results right now (i.e. each batch.detect. up until batch.detect.compose, and batch modules contain lots of non-trivial steps outside of core.). But what if I want to combine with a different, external cropper/deskewer/binarizer/textline-segmenter/recognizer?

As you know, I want to build an OCR-D wrapper for Origami – for which (to do it right) I probably need most of the above, but I certainly don't want to either require effort on your side, or risk having to rewrite everything every now and then as your code base evolves.

My first (unfinished) attempt was too ambitious for sure: https://github.com/bertsky/ocrd_origami/blob/master/ocrd_origami/segment.py

(I'll rethink how I can live without 2 but still achieve some 3, but your advice would be much appreciated.)

poke1024 commented 2 years ago

The code base of Origami at this point is very static, I don't see any changed in the foreseeable future. Having an OCR-D implementation would make the OCR-D version the de facto version of Origami.

The first attempt of segment.py takes the exactly right approach IMHO. Instead of wrapping Origami's Processor and Origami's own artifacts logic, it seems more sane to take the approach you have taken, i.e. to focus on each processor in turn and build a new processor logic around it.

To answer (2), the file-based approach is not simple to replace. The easiest and best approach to take is the one you have taken: to start at a processor class and rewrap the items that come into the process method as arguments.

There is no simple answer to (3). There have to be wrappers for importing and exporting Origami's internal formats (mostly JSON) to some external format. The most difficult aspect is probably that stages later in the pipeline need various inputs from earlier stages.

I think we should set up a video call to go over the main issues.