openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

Expose load_system_dawg for non textual ouput #61

Open awiebe opened 7 years ago

awiebe commented 7 years ago

--load_system_dawg 0 would be helpful as an argument in image_to_text, perhaps as an options dictionary. Feel free to call it something that makes it language agnostic

jflesch commented 7 years ago

You can simply create a builder object yourself. You can have a look at https://github.com/jflesch/pyocr/blob/master/src/pyocr/tesseract.py#L57 for an example. Basically you just need to inherit from BaseBuilder and define tess_conf = ["--load_system_dawg", "0"], file_ext = ['the_file_extension_that_tesseract_will_use'], and the methods read_file() and write_file().

If you implement such builder, feel free to send a pull request to include it in src/pyocr/tesseract.py.

outkaj commented 7 years ago

I implemented a similar builder here, if it's helpful. In my case, I needed a modification to WordBoxBuilder with dictionary-related parameters set to false.

This is a work in progress, since I may modify the parameters further - once it's complete, I'm happy to submit the builder as a pull request.

gamykla commented 7 years ago

Would be great if you could just override tess_conf without having to extend the base builder