HDD may cause training to be too slow.

BinuxLiu commented 1 year ago

Hello!

I downloaded SF-XL-small and SF-XL-processed. However, I can't download the test set(116 GB) of the SF-XL-processed completely. The first few times the download was interrupted, the last download was successful, but a few images were lost. (This problem may only be encountered by Chinese researchers. Our network has some limitations.) So I have a suggestion, you can split the test set into multiple unpacks. This makes it easier for researchers to follow your work and increase your citations.

% test\database\@0549678.37@4180874.60@10@S@037.77386@-122.43590@bZ1VXzxFCSKi9wcz8H5pZA0@201311.jpg - CRC 校验错误。
% test\database\@0550100.11@4181039.14@10@S@037.77532@-122.43110@mowOE8WmnpkJQXWhP9otKQ50@201311.jpg - CRC 校验错误。
% test\database\@0551862.96@4181924.37@10@S@037.78320@-122.41102@o-Og9bu0EkJjMfWGL8IbWg0@201311.jpg - 数据错误 - 该文件已损坏。
% test\database\@0553675.41@4175945.59@10@S@037.72921@-122.39088@3-nlld1RUGlBx_3PBqBL_A40@201311.jpg - CRC 校验错误。
% test\database\@0553675.31@4180966.25@10@S@037.77446@-122.39051@7I8h9hh0LRqG1nwkdRDejg0@201305.jpg - 该文件已损坏。

I used a 3090 Ti to train on the SF-XL-processed, which was very slow and not consistent with what you said in #4 . Can you give me some suggestions? (I put data on a mechanical hard drive, could this be the reason?)

2022-12-06 09:42:25   train.py --dataset_folder=/mnt/sda2/Datasets/vpr_datasets/sf_xl/processed --backbone resnet18 --use_amp16 --resume_train logs/default/res18/last_checkpoint.pth
2022-12-06 09:42:25   Arguments: Namespace(L=2, M=10, N=5, alpha=30, augmentation_device='cuda', backbone='resnet18', batch_size=32, brightness=0.7, classifiers_lr=0.01, contrast=0.7, dataset_folder='/mnt/sda2/Datasets/vpr_datasets/sf_xl/processed', device='cuda', epochs_num=50, fc_output_dim=512, groups_num=8, hue=0.5, infer_batch_size=16, iterations_per_epoch=10000, lr=1e-05, min_images_per_class=10, num_workers=8, positive_dist_threshold=25, random_resized_crop=0.5, resume_model=None, resume_train='logs/default/res18/last_checkpoint.pth', saturation=0.7, save_dir='default', seed=0, test_set_folder='/mnt/sda2/Datasets/vpr_datasets/sf_xl/processed/test', train_set_folder='/mnt/sda2/Datasets/vpr_datasets/sf_xl/processed/train', use_amp16=True, val_set_folder='/mnt/sda2/Datasets/vpr_datasets/sf_xl/processed/val')
2022-12-06 09:42:25   The outputs are being saved in logs/default/2022-12-06_09-42-25
2022-12-06 09:42:25   There are 1 GPUs and 24 CPUs.
2022-12-06 09:42:39   Using cached dataset cache/processed_M10_N5_mipc10.torch
2022-12-06 09:42:59   Using 8 groups
2022-12-06 09:42:59   The 8 groups have respectively the following number of classes [35790, 35922, 35214, 35526, 35958, 35046, 35520, 35610]
2022-12-06 09:42:59   The 8 groups have respectively the following number of images [706128, 709044, 688956, 702792, 712152, 695616, 689177, 703554]
2022-12-06 09:47:05   Validation set: < val - #q: 7983; #db: 8015 >
2022-12-06 09:47:05   Test set: < test - #q: 1000; #db: 2805839 >
2022-12-06 09:47:05   Loading checkpoint: logs/default/res18/last_checkpoint.pth
2022-12-06 09:47:07   Resuming from epoch 8 with best R@1 86.5 from checkpoint logs/default/res18/last_checkpoint.pth
2022-12-06 09:47:07   Start training ...
2022-12-06 09:47:07   There are 35790 classes for the first group, each epoch has 10000 iterations with batch_size 32, therefore the model sees each class (on average) 8.9 times per epoch
2022-12-06 12:28:13   Epoch 08 in 2:41:06, < val - #q: 7983; #db: 8015 >: R@1: 86.2, R@5: 92.8
2022-12-06 15:00:23   Epoch 09 in 2:32:07, < val - #q: 7983; #db: 8015 >: R@1: 86.6, R@5: 93.2
2022-12-06 17:35:24   Epoch 10 in 2:34:58, < val - #q: 7983; #db: 8015 >: R@1: 87.1, R@5: 93.6
2022-12-06 20:13:33   Epoch 11 in 2:38:07, < val - #q: 7983; #db: 8015 >: R@1: 87.3, R@5: 93.7

I noticed that you use early stopping strategy in VG Benchmark, but you fix 50 epochs in CosPlace. These settings confuse me.

Looking forward to your reply! Thank you!

gmberton commented 1 year ago

Hello, thank you for this detailed issue!

in the next days I will split the test set in multiple files to make it easier to download. Regarding the few images that were lost, is it only those 5? If only a few images were lost, I can send them to you privately.
the training time that you get is way slower than it should be. To check if the data loading is the bottleneck you can comment all steps in the training loop except for the data loading. In this code, comment everything after this line. Then launch the script again. If it is much faster, then data loading was not the issue. If it is still slow (i.e. ~2 hours per epoch), then data loading is the issue, and it might be due to non-performant HDD.
the benchmark is a different paper from the one in this repo. In the benchmark we mostly focus on previous literature, which is based on the seminal NetVLAD paper. They do early stopping, so we do early stopping in the benchmark. On the other hand, CosPlace is a newer paper that presents a new method. We like to keep things simple so we don't use early stopping, although using it might improve the results a bit.

I hope this helps! Let me know if everything is clear

BinuxLiu commented 1 year ago

Thank you! Your answers and suggestions are very helpful.

Actually I was missing 8 images. It would be great if you could send it to me. Thank you very much! My email: binuxliu@gmail.com

test\database\@0549678.37@4180874.60@10@S@037.77386@-122.43590@bZ1VXzxFCSKi9wcz8H5pZA0@201311.jpg - CRC 校验错误。
test\database\@0550100.11@4181039.14@10@S@037.77532@-122.43110@mowOE8WmnpkJQXWhP9otKQ50@201311.jpg - CRC 校验错误。
test\database\@0551862.96@4181924.37@10@S@037.78320@-122.41102@o-Og9bu0EkJjMfWGL8IbWg0@201311.jpg - 数据错误 - 该文件已损坏。
test\database\@0553675.41@4175945.59@10@S@037.72921@-122.39088@3-nlld1RUGlBx_3PBqBL_A40@201311.jpg - CRC 校验错误。
test\database\@0553675.31@4180966.25@10@S@037.77446@-122.39051@7I8h9hh0LRqG1nwkdRDejg0@201305.jpg - 该文件已损坏。
37.76\@0547118.43@4180366.93@10@S@037.76942@-122.46500@_7Q7YI50kU61sIvvbI7B8A20@201711.jpg - 数据错误 - 该文件已损坏。
37.76\@0547118.27@4179778.88@10@S@037.76412@-122.46504@waWUF2LTOvu9UP-bz6nwHQ20@201508.jpg - 该文件已损坏。
37.75\@0549255.34@4178565.35@10@S@037.75307@-122.44086@rIuCm3LeWaat0tSmjip-fA0@201904.jpg - CRC 校验错误。

I tried using the comments as your suggested and it seems my HDD is causing the issue.
I see.

gmberton / CosPlace

HDD may cause training to be too slow. #14