Open jlyon opened 9 years ago
Sorry about the delay, this is getting into an area I haven't played with much yet.
Overall though, this seems like a good approach.
I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.
Great. I'm heading in vacation for a few weeks, but I'll start taking a look when I return. On Jul 6, 2015 10:19 AM, "Traun Leyden" notifications@github.com wrote:
Sorry about the delay, this is getting into an area I haven't played with much yet.
Overall though, this seems like a good approach.
I'd be excited to see this happen to get the preprocessor stuff a little more fleshed out and documented.
— Reply to this email directly or view it on GitHub https://github.com/tleyden/open-ocr/issues/39#issuecomment-118929042.
+1
+1 as well.. this would be really nice to have
I'd be interested in seeing this too; I wrote something similar in Django a while back (https://github.com/ddohler/webocr). One minor subtlety is that some PDFs contain embedded images and text on the same page. As I recall, my solution was to rasterize the entire page and then OCR the resulting image, but there are definitely other ways to attack the problem. Next time I have a spare moment I'm planning to try replacing my home-grown Tesseract setup with open-ocr, so I'll hopefully be in a position where I can make occasional contributions.
@ddohler looking forward to seeing that!
Realted to #17.
I wrote a bash script awhile ago (https://github.com/jlyon/ocr-anything) that would analize the mimetype headers on the document and:
I would like to try to integrate the efforts with open-ocr. What do you think is the best way to go about this? This is my thinking:
Does this seem right?