openpaperwork / pyocr

A Python wrapper for Tesseract and Cuneiform -- Moved to Gnome's Gitlab
https://gitlab.gnome.org/World/OpenPaperwork/pyocr
930 stars 152 forks source link

File io optimization #20

Closed ventsyv closed 8 years ago

ventsyv commented 9 years ago

Memory pipes are fifo structures which for the most part behave like normal files but are much faster due to the fact that the data is stored in memory and not on the disk. The two main differences are that pipes do not support seek operations, and that they block the thread. To get around the blocking issue, after the two pipes are created, I start a thread that then starts the OCR engine process. This worker thread will stay blocked waiting for input until the main thread opens the pipes and writes the data. After the main thread writes the input data, it waits for the OCR engine's thread to terminate before reading the pipe. That's done to ensure that all the OCR output has been written to the pipe.

Unfortunately, Tesseract does not support stdin and the PIL.image.save function requires seek operation, thus it can't use the pipe. I tried manually saving the pixels to the pipe, but I'm guessing I'm missing some header because Tesseract thinks the file is invalid. For the time being, I put back the file IO.

jflesch commented 9 years ago

Looks great. However right now I don't have time to review it correctly. I'll review it Sunday (I can't do it before, sorry).

ventsyv commented 9 years ago

No worries, took me a month to write, it can wait a few days :-). Can you please run the unit tests from the master branch and this one so we can compare how long it takes for the tests to finish? I meant to do it, but forgot.

jflesch commented 9 years ago

Unfortunately, it seems the tests on your branch remain stuck. Here is the stacktrace I get when I do Ctrl-C:

(...)
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
    testMethod()
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 219, in test_basic
    self.__test_txt('test.png', 'test.words')
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 201, in __test_txt
    builder=self.builder)
  File "src/pyocr/tesseract.py", line 241, in image_to_string
    with codecs.open(output_file_full_name, 'r', encoding='utf-8', errors='replace') as outFile_desc:
  File "/usr/lib64/python2.7/codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
KeyboardInterrupt
jflesch commented 9 years ago

From what I can tell, your changes regarding Cuneiform seem to work fine. However those regarding Tesseract don't work. So here what I suggest you do :

And everything should be good.

Other than that, nice work :)

By the way, sorry if some of my comments (or their number) may seem rude. I'm just trying to keep the code as good as possible.

ventsyv commented 9 years ago

I did not realize you were adhering to a coding standard. I'll run it through the formatting tool. Did you run the unit tests before making any changes? The operations on the pipes need to occur in a particular order, otherwise the thread will block. I'll take a look in a minute, but the code I checked in was passing all tests.

ventsyv commented 9 years ago

Ok, I made the necessary pep8 changes and checked them in. I ran the tests and they are all passing. Could you please update from the repository and run the code as is?

jflesch commented 9 years ago

It's much better regarding PEP8. However, the Tesseract tests still remain stuck when I run them.

ventsyv commented 9 years ago

What version of tessera ct do u have? What OS? I'll set ip a VM and test it. On Oct 2, 2014 5:22 PM, "Jerome Flesch" notifications@github.com wrote:

It's much better regarding PEP8. However, the Tesseract tests still remain stuck when I run them.

— Reply to this email directly or view it on GitHub https://github.com/jflesch/pyocr/pull/20#issuecomment-57710899.

jflesch commented 9 years ago

I'm using Fedora 20 and Tesseract 3.02.02 (the one provided by Fedora)