About the division of test set and validation set

wpumain commented 1 year ago

In the paper, you mention dividing the dataset like this: • MSLS Test set: Copenaghen, San Francisco: the former official validation set; • MSLS Validation set: Amsterdam, Manila: these two cities were released in [4] as part of the training set. We used as criteria to choose them (i) size comparable to the test split (ii) use two cities, like the formerly proposed validation set, from two different continents; • MSLS Train set: same as the official one from [4], with the exception of the 2 cities chosen for validation.

It is completely possible to divide the training set, verification set and test set in the 【1_reformat_mapillary.py】 file at one time https://github.com/vandal-vpr/vg-transformers/blob/c57fca9085a8dfcfd74e21ddea8f6722940de5dd/main_scripts/msls/1_reformat_mapillary.py#L70

But why divide the training set and non-training set when actually doing it? https://github.com/vandal-vpr/vg-transformers/blob/c57fca9085a8dfcfd74e21ddea8f6722940de5dd/main_scripts/msls/1_reformat_mapillary.py#L113

Is the 【2_reformat_testset_msls.py】 file used to divide the training set and the verification set? What does this annotation mean?https://github.com/vandal-vpr/vg-transformers/blob/c57fca9085a8dfcfd74e21ddea8f6722940de5dd/main_scripts/msls/2_reformat_testset_msls.py#L52 The training set, verification set, and test set have been clearly divided in the paper. Why do we need to change the division according to different situations?

ga1i13o commented 1 year ago

Hello, yes you are right, all the reformatting could have been in a single script, I divided it into 2 to make each script have a clearer and simple job ( this was my opinion, perhaps you find it more confusing). Also, the first script may take a while, especially if you create a duplicate copy rather than just move files, whereas the second one is quite fast.

The first script has only the job to reformat the file system from train_val/city/database/images to train/database/sequence_id/images. The distincion between train and non-train is simply because the original MSLS version has the folders train_val, and test, although they never released labels for the test set.

The second script, is used to obtain any arbitrary split that we want in terms of assigning a city to a desired split. You are right I could have just hardcoded the split that we used throughout all experiments in our paper, but I thought I could give people an easy way to create splits as they please.

The annotation that you ask clarifications about tries to explain how the dictionary moves works. The keys of the dictionary represent the tuples ( origin split, dest split ), and the values represent the cities that will go from origin to dest. for example the line ('train', 'val'): ['amsterdam', 'manila'], means move amsterdam and manila from train to val.

when you ask "Why do we need to change the division according to different situations?" ; the answer is that if you want to reproduce our results you can run the script only once and never change the divisions.

You can re-run the script if you want to obtain different splits for your own purposes.

the reason why we created different splits than the original MSLS is that, as I said, test set labels were never released by the authors

Hope i was clear and that this helps you

wpumain commented 1 year ago

Think you for your help ! you said, test set labels were never released by the authors of MSLS. But why is there a test folder in the MSLS dataset I downloaded? downloaded form https://www.mapillary.com/dataset/places The downloaded file structure is as follows: ── msls_checksums.md5 ├── msls_images_vol_1 │ ├── test │ │ ├── athens │ │ ├── bengaluru │ │ ├── kampala │ │ └── stockholm │ └── train_val │ ├── amsterdam │ ├── budapest │ ├── goa │ ├── moscow │ ├── paris │ └── zurich ├── msls_images_vol_2 │ ├── test │ │ └── kampala │ └── train_val │ ├── amman │ ├── amsterdam │ ├── boston │ ├── goa │ ├── nairobi │ ├── ottawa │ ├── phoenix │ ├── saopaulo │ ├── tokyo │ ├── toronto │ └── trondheim ├── msls_images_vol_3 │ ├── test │ │ └── buenosaires │ └── train_val │ ├── austin │ ├── budapest │ ├── melbourne │ ├── saopaulo │ └── sf ├── msls_images_vol_4 │ ├── test │ │ ├── buenosaires │ │ └── miami │ └── train_val │ ├── austin │ ├── bangkok │ ├── berlin │ ├── boston │ ├── cph │ ├── helsinki │ ├── manila │ ├── sf │ ├── tokyo │ └── toronto ├── msls_images_vol_5 │ ├── test │ │ └── athens │ └── train_val │ ├── bangkok │ ├── london │ ├── moscow │ ├── paris │ └── phoenix ├── msls_images_vol_6 │ ├── test │ │ └── miami │ └── train_val │ ├── helsinki │ ├── london │ ├── manila │ ├── trondheim │ └── zurich ├── msls_metadata │ ├── test │ │ ├── athens │ │ ├── bengaluru │ │ ├── buenosaires │ │ ├── kampala │ │ ├── miami │ │ └── stockholm │ └── train_val │ ├── amman │ ├── amsterdam │ ├── austin │ ├── bangkok │ ├── berlin │ ├── boston │ ├── budapest │ ├── cph │ ├── goa │ ├── helsinki │ ├── london │ ├── manila │ ├── melbourne │ ├── moscow │ ├── nairobi │ ├── ottawa │ ├── paris │ ├── phoenix │ ├── saopaulo │ ├── sf │ ├── tokyo │ ├── toronto │ ├── trondheim │ └── zurich └── msls_patch_v1.1 ├── LICENSE.txt └── train_val ├── berlin └── melbourne

Judging from the downloaded MSLS file, they put the train and val datasets together。But they distinguish the test set

ga1i13o commented 1 year ago

the test folder exists because the images are indeed present. what is missing is the .csv with the ground truths, so you cannot do any evaluation on those images

wpumain commented 1 year ago

Thank you very much for your help

vandal-vpr / vg-transformers

About the division of test set and validation set #8