thuml / HashNet

Code release for "HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)
MIT License
240 stars 85 forks source link

False sampling of data #41

Open hbellafkir opened 4 years ago

hbellafkir commented 4 years ago

hi,

I just found out, that all images in the query list are also in the database list, which is not allowed for fair validation.

thanks

vinnik-dmitry07 commented 1 year ago

-- Second this.

prefix = 'D:/Downloads/HashNet-master/HashNet-master/pytorch/data/'
for dataset in ['imagenet', 'coco', 'nuswide_81']:
    with open(prefix + f'{dataset}/train.txt', 'r') as f:
        train = set(f.read().splitlines())
    with open(prefix + f'{dataset}/test.txt', 'r') as f:
        test = set(f.read().splitlines())
    with open(prefix + f'{dataset}/database.txt', 'r') as f:
        database = set(f.read().splitlines())
    print(dataset, len(train.intersection(database)))
    print(dataset, len(test.intersection(database)))
    print(dataset, len(test.intersection(train)))
imagenet 13000
imagenet 0
imagenet 0
coco 0
coco 5000
coco 0
nuswide_81 10000
nuswide_81 0
nuswide_81 0

During test time we use test.txt as query and database.txt as retrieval. They should not intersect which is wrong for COCO.