rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

Bulk insertion #115

Closed Davidramsey03 closed 5 years ago

Davidramsey03 commented 5 years ago

The readme suggests that image-match is capable of high speed image insertion into the database.

i have everything working properly, but am only achieving a sustained 3 images per second using the add_image function.

For reference, here's my function I'm using to add currently (walking through subdirectories, adding all images found):

def addImagesRecursive(topDirectory):
    if (os.path.isdir(topDirectory):
        count=0
        for root, dirs, files in os.walk(topDirectory):
            for name in files:
                if name.endswith('.jpg'):
                    ses.add_image(os.path.join(root, name))
        print("Total added: ", count)

Is there something I'm doing wrong that's making this add so slow? (New to python, so go easy if it's obvious...) Thanks in advance for any input.

drakerc commented 5 years ago

Are you using an Elasticsearch repository? When it comes to adding an image, your function looks fine. For me, it takes about 40ms per image and my images are downloaded from an external CDN (and I also have an additional step that searches my Elastic for duplicates before insertion as I'm only adding unique images), so 3 images per second sounds really slow.

Are you using Windows? As far as I know, there are some performance issues when using isdir and os.walk. Another thing is that in Python, the recommended way is EAFP (Easier to ask for forgiveness than permission.). So instead of checking if something is a directory, just try-except it (this code has not been tested):

def add_images_recursively(top_directory):
    try:
        count = 0
        for entry in os.scandir(top_directory):
            if entry.is_file() and entry.name.endswith('.jpg'):
                ses.add_image(entry.path)
                count += 1  # your function does not increment the count by the way
        print("Total added: ", count)
    except OSError:
        pass  # not a directory

If this does not speed things up, then it's probably something wrong with your Mongo/Elastic or perhaps hard drive / image sizes...?

Davidramsey03 commented 5 years ago

Thanks for the advice.

I actually wrote and tested several variants using your advice, and tried it on 2 devices (laptop and desktop), and on 2 OS's (windows and linux). Turns out my windows install on my desktop vastly outperforms the laptop install, and the os.scandir / os.walk are actually similar in performance to each other.

On the more powerful desktop machine, in windows, the same code is averaging approximately 21 insertions per second, so it seems to be a system resources issue for me a this point.

Thanks again for your help.