gitanat / simple-ocr-opencv

A simple python OCR engine using opencv
GNU Affero General Public License v3.0
525 stars 175 forks source link

Fix opening of files for non-test files #25

Closed gitanat closed 7 years ago

gitanat commented 7 years ago

Migrated from PR 24

It seems the current code for files.py doesn't allow loading specific (non-test) files given a relative path (tested on linux).

There may be more issues at well. @RedFantom can you elaborate on the issues you found?

RedFantom commented 7 years ago

This is an example where it goes wrong with the code in distutils currently. This behaviour was first observed on Windows, and then I pushed to Travis to test on Linux, where the same thing apparently goes wrong. The file is copied to the working directory, then the simpleocr folder is removed to change from a local import to a system-wide package import and then the test is run. The file cannot be found, I think this is because of this code:

class ImageFile(object):
    def __init__(self, image_path):
        good_path = try_extensions(image_path, IMAGE_EXTENSIONS)
        good_path = try_extensions(os.path.join(DATA_DIRECTORY, image_path), IMAGE_EXTENSIONS)

good_path is actually immediately reassigned without doing anything with the value. As it turns out, you were right and the problem is actually not in the changes I made. The problem seems to be with the fact that the library is using relative paths and these paths may actually resolve differently than is expected when a different working directory is used. The only reason it worked before for me was because I was using the correct working directory. The original master code (as at the time of writing) can be seen failing the test here.

I think the best solution for this would be a function that would resolve the correct path, preferably with absolute paths, returning the correct file path or otherwise raising a FileNotFoundError, which, I think, would be more suitable for this purpose than a general Exception.

Update: This is the test code to open a custom file:


from simpleocr.files import ImageFile, DATA_DIRECTORY
import os
from shutil import copyfile
import unittest

class TestImageFile(unittest.TestCase):
    # The other tests go here
    def test_open_custom(self):
        # First copy the file to a reachable folder
        digits_path = os.path.join(DATA_DIRECTORY, "digits1.png")
        destination = os.path.join(os.getcwd(), "digits_custom.png")
        copyfile(digits_path, destination)
        file = ImageFile("digits_custom")
        self.assertIsInstance(file, ImageFile)
        os.remove(destination)
gitanat commented 7 years ago

Thanks for clarifying @RedFantom ! I don't think we can use FileNotFoundError, as it's only defined on python3. We can use IOError, though.

I think it might be a good idea to decouple the fuzzy-file-finding mechanism from the ImageFile class, and have the latter only take correct paths on __init__. Let's fix this once we have the packaging PR merged in, to avoid merge conflicts

gitanat commented 7 years ago

I've pushed a temporary branch with 0f00c4723abdbbff6567f4858ebf79346b75483d @RedFantom can you please check if this fixes the issue for you?

RedFantom commented 7 years ago

While these changes do fix the issue for me, I actually think that it would be more appropriate to give the ImageFile class a complete overhaul, to improve compatibility with Pillow (as it's the most used Python library with images), but also just to allow the user to work with images that are actually not stored on disk but only in memory. My interpretation of such a class looks like this.

gitanat commented 7 years ago

I like the idea of being able to work with images in memory without a corresponding file - this is quite useful if you're just doing the classification part in real time. Getting images from pillow buffers is also nice.

I have a few questions about your implementation, but it's probably better to handle them on a PR, so we keep things organized.

I went through your suggestions on the commit, and I'll check this into master for now, as it fixes the issue with minimal changes, but feel free to open a PR to merge yours