dataset split issues - Githubissues

XinLan12138 commented 2 years ago

Hi @oravus , thanks for your help and I have generated the needed db files using pitts250k dataset. Generally, I use the whole dataset to generate descriptors and saved both query and ref .npy files. That is : ===> Loading dataset(s) All Db descs: (254064, 4096) All Qry descs: (24000, 4096)

Next, I use the first 10000 images as train sets, followed by 3000 images as validation and 3000 images as test sets. Also, I generated 3 .db files as train_mat_file, test_mat_file, val_mat_file.

Thereafter, I wrote the specification in get_datasets.py. The indexes are defined as nordland dataset format: trainInds, testInds, valInds = np.arange(10000), np.arange(10000,13000), np.arange(13000,16000)

And I think I can test the dataset using your pretrained model, but things comes out that still errors occur regarding index.

////////////////////////////////////////// Restored flags: ['--optim', 'SGD', '--lr', '0.0001', '--lrStep', '50', '--lrGamma', '0.5', '--weightDecay', '0.001', '--momentum', '0.9', '--seed', '123', '--runsPath', './data/runs', '--savePath', './data/runs/Jun03_15-22-44_l10_l10_w5_seqnetEnv/checkpoints', '--patience', '0', '--pooling', 'seqnet', '--w', '5', '--outDims', '4096', '--margin', '0.1'] Namespace(batchSize=16, cacheBatchSize=24, cachePath='./data/cache', cacheRefreshRate=0, ckpt='latest', dataset='pitts250k', descType='netvlad-pytorch', evalEvery=1, expName='0', extractOnly=False, lr=0.0001, lrGamma=0.5, lrStep=50.0, margin=0.1, mode='test', momentum=0.9, msls_trainCity='melbourne', msls_valCity='austin', nEpochs=200, nGPU=1, nocuda=False, numSamples2Project=-1, optim='SGD', outDims=4096, patience=0, pooling='seqnet', predictionsFile=None, resultsPath=None, resume='./data/runs/Jun03_15-22-44_l10_w5/', runsPath='./data/runs', savePath='./data/runs/Jun03_15-22-44_l10_l10_w5_seqnetEnv/checkpoints', seed=123, seqL=5, seqL_filterData=None, split='test', start_epoch=0, threads=8, w=5, weightDecay=0.001) ===> Loading dataset(s) All Db descs: (254064, 4096) All Qry descs: (24000, 4096) ===> Evaluating on test set ====> Query count: 800 ===> Building model => loading checkpoint './data/runs/Jun03_15-22-44_l10_w5/checkpoints/checkpoint.pth.tar' => loaded checkpoint './data/runs/Jun03_15-22-44_l10_w5/checkpoints/checkpoint.pth.tar' (epoch 200) ===> Running evaluation step ====> Extracting Features 20%|████████████████████████████████████▉ | 49/249 [00:00<00:01, 127.41it/s]==> Batch (50/250) 39%|█████████████████████████████████████████████████████████████████████████▏ | 97/249 [00:00<00:01, 148.19it/s]==> Batch (100/250) 58%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 145/249 [00:01<00:00, 154.13it/s]==> Batch (150/250) 78%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 193/249 [00:01<00:00, 156.80it/s]==> Batch (200/250) 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 242/249 [00:01<00:00, 158.98it/s]==> Batch (250/250) Average batch time: 0.006786982536315918 0.009104941585046229
torch.Size([3000, 4096]) torch.Size([3000, 4096]) ====> Building faiss index ====> Calculating recall @ N Using Localization Radius: 25 Traceback (most recent call last): File "main.py", line 133, in recallsOrDesc, dbEmb, qEmb, rAtL, preds = test(opt, model, encoder_dim, device, whole_test_set, writer, epoch, extract_noEval=opt.extractOnly) File "/home/lx/lx/Seqnet_new/test.py", line 138, in test rAtL.append(getRecallAtN(n_values, predictions, gtAtL)) File "/home/lx/lx/Seqnet_new/test.py", line 37, in getRecallAtN if len(gt[qIx]) == 0: IndexError: list index out of range

///////////////////////////////////////// I am writing to ask :

how should the index should be defined? Could you use oxford dataset or nordland dataset to explain the details for me?
should npy files contain the same number of descriptors as I generate the dataset as 3 db files for train, test, and val?

THANKS FOR HELP!

XinLan12138 commented 2 years ago

Just to be more specific, my dataset specification is as below:

elif 'pitts250k' in opt.dataset.lower():
    dataset = Dataset('pitts250k', 'pitts250k_train_new.db', 'pitts250k_test_new.db', 'pitts250k_val_new.db', opt)  # train, test, val structs

    ref, qry = 'ref', 'qry'
    ft1 = np.load(join(prefix_data,"descData/{}/pitts250k-{}.npy".format(opt.descType,ref)))
    ft2 = np.load(join(prefix_data,"descData/{}/pitts250k-{}.npy".format(opt.descType,qry)))
    trainInds, testInds, valInds = np.arange(10000), np.arange(10000,13000), np.arange(13000,16000)

    dataset.trainInds = [trainInds, trainInds]
    dataset.valInds = [valInds, valInds]
    dataset.testInds = [testInds, testInds]
    encoder_dim = dataset.loadPreComputedDescriptors(ft1,ft2)

oravus commented 2 years ago

Hi @XinLan12138,

It seems that your db files are missing utmDb and utmQ fields.
npy files have shape N x 4096, where N is the number of images as listed in the corresponding files in imageNamesFiles. Train/val/test splits use the indices (range in [0,N-1]), which index the descriptor data corresponding to the images stored in their respective db files. For Nordland, because of its simplicity, we directly define it through code. For Oxford, we load it separately. For MSLS, any split is defined for the whole city, so indices cover the whole range of data used from that city.

Since all the datasets are somewhat unique, get_datasets.py does the dataset-specific handling. What is applicable for the Nordland dataset (which has one-to-one frame correspondence across traverses) may not be right for others. So, your Pittsburgh settings as you have shared might work but could still be incorrect in its usage. Of all the the differences, the main one here is lack of sequential traverses in Pittsburgh, as also briefly discussed earlier. May I ask what do you intend to do with the Pittsburgh dataset, so I might be able to point you in the right direction?

EDITs made to the second point description, check edit history please.

XinLan12138 commented 2 years ago

Hi @oravus . Through your description, I now understand the inner relationship of dataset specifications.
Initially, I consisted of using pitts250k because I also work on PyTorch version of Netvlad https://github.com/Nanne/pytorch-NetVlad, and hope to compare through pitts250k among your two models. I download pitts250k dataset specifications of their .mat structures and find utmDb and utmQ information there. I don't realize or understand your meaning about mentioned sequential traverses before :) What I am going to do is try to compare several models like your sequence-based RGB image models, or pointnetvlad which is based on point cloud information, or models using point cloud projecting 2D image as data, etc. I am now doing a dissertation on it. Hope you can give me some suggestions about what should I focus on!

thanks for your patience!

oravus commented 2 years ago

Hi @XinLan12138,

With sequences I meant that we assume that images are collected as a data stream from a forward-facing camera mounted on a vehicle driving down a road. A part of this data stream can then be considered at any unique GPS co-ordinate as a short sequence of images, where this sequence will have overlapping views of the environment at that location. This is not the case with the Pitts250K dataset. So, it might not be possible to define image sequences in that dataset the way it is used in SeqNet (in line with prior similar work in this field). The code will index data in a sequence of L frames at every index, which might not be meaningful for Pitts250K.

If you were thinking something else, and I might have misunderstood, let me know.

oravus commented 2 years ago

Closing it now, please feel free to reopen if needed.

oravus / seqNet

dataset split issues #8