Closed The-Compiler closed 8 years ago
This is proprietary software. I'm not willing to include support for proprietary tools. The main reason is philosophical, but there is also a technical reason : The free trial version has a limit of 100 OCR runs. With such limit, I cannot write automated tests to prevent regressions in PyOCR. And I'm not going to pay for something I don't need.
Sorry.
I understand - if I was the maintainer, hell, I'd probably say the same in a similar situation.
I'm guessing your answer won't change if I ask ABBYY if they're willing to provide a license (or pay for one for PyOCR myself) and submit a PR? I'm really not happy with running proprietary software when I can avoid it either, but with a scanned A4 page with reasonably clear printed text, Tesseract still doesn't even manage to notice any text on the page except the title, so that's really unusable for me... :worried:
It wouldn't change much to the problem : Most of the time, I'm the only one working on PyOCR, but sometimes I get contributions. Those contributors may need to be able to run the tests too (all of them).
So I tried using paperwork and was not really satisfied with the results, it looks like Tesseract works as bad with my documents as it did some years ago when I last tried...
I found ABBYY OCR for Linux to work much better (at least for my documents), but I found the tooling around it to be lacking, so I didn't buy it so far (but played with the trial).
What do you think about integration of that into PyOCR? It seems to have an XML export with character box information, so I think that should work.
If you agree with the idea, I might contribute one day - but I'm currently very busy with my own projects, so that'll probably take a few months.