rhsimplex / image-match

🎇 Quickly search over billions of images
2.94k stars 405 forks source link

add delete_duplicates method #44

Closed jtara1 closed 8 years ago

jtara1 commented 8 years ago

When testing image_match module out I noticed I'd add duplicate images with the same path parameter.

This method will remove all (except for one) entries in the elasticsearch given a path.

rhsimplex commented 8 years ago

@jtara1 thanks for the PR, this is a handy function.

Anyway, we'll need a test to cover the new functionality. If you'd like to do it, have a look at https://github.com/ascribe/image-match/blob/master/tests/test_elasticsearch_driver.py otherwise I'm happy to do it myself, just know that I could be a few days to a week before I can get to it.

Thanks again for your work!

jtara1 commented 8 years ago

I'll write a test for it within the next two days. Not too much experience writing tests with pytest though.

I have another commit that allows me to refresh elasticsearch after I call add_image method by passing refresh=True. It's useful to me in another project, and I could add that commit here.

rhsimplex commented 8 years ago

Something like this should do it. Instantiation of ses and test1.jpg are handled by the test setup

def test_duplicate_removal(ses):
    for i in range(10):
        ses.add_image('test1.jpg')
    sleep(1)
    r = ses.search_image('test1.jpg')
    assert len(r) == 10
    ses.delete_duplicates('test1.jpg')
    sleep(1)
    r = ses.search_image('test1.jpg')
    assert len(r) == 1

Good call on the refresh parameter. We'd be very happy if you could make another small PR for it. Only change is that the default should be false (to keep inserts fast when thoroughput is a consideration). A good test for that would be:

def test_index_refresh(ses):
    ses.add_image('test1.jpg')
    r = ses.search_image('test1.jpg', refresh_after=True)
    assert len(r) == 1

Since that will test if the thing is available immediately (without the sleep). Once we merge that, then I can remove all the stupid sleep calls from the tests =)