openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

ABBYY OCR support? #36

Closed The-Compiler closed 8 years ago

The-Compiler commented 8 years ago

So I tried using paperwork and was not really satisfied with the results, it looks like Tesseract works as bad with my documents as it did some years ago when I last tried...

I found ABBYY OCR for Linux to work much better (at least for my documents), but I found the tooling around it to be lacking, so I didn't buy it so far (but played with the trial).

What do you think about integration of that into PyOCR? It seems to have an XML export with character box information, so I think that should work.

If you agree with the idea, I might contribute one day - but I'm currently very busy with my own projects, so that'll probably take a few months.

jflesch commented 8 years ago

This is proprietary software. I'm not willing to include support for proprietary tools. The main reason is philosophical, but there is also a technical reason : The free trial version has a limit of 100 OCR runs. With such limit, I cannot write automated tests to prevent regressions in PyOCR. And I'm not going to pay for something I don't need.

Sorry.

The-Compiler commented 8 years ago

I understand - if I was the maintainer, hell, I'd probably say the same in a similar situation.

I'm guessing your answer won't change if I ask ABBYY if they're willing to provide a license (or pay for one for PyOCR myself) and submit a PR? I'm really not happy with running proprietary software when I can avoid it either, but with a scanned A4 page with reasonably clear printed text, Tesseract still doesn't even manage to notice any text on the page except the title, so that's really unusable for me... :worried:

jflesch commented 8 years ago

It wouldn't change much to the problem : Most of the time, I'm the only one working on PyOCR, but sometimes I get contributions. Those contributors may need to be able to run the tests too (all of them).