lucas-ventura / CoVR

Official PyTorch implementation of the paper "CoVR: Learning Composed Video Retrieval from Web Video Captions".
https://imagine.enpc.fr/~ventural/covr/
MIT License
80 stars 7 forks source link

Does batch size have a significant impact on performance? #12

Closed chan-kor closed 4 months ago

chan-kor commented 5 months ago

hello.

I was so impressed with the amazing CoVR task and results proposed by the authors that I tried to reimplement the code.

The code worked well, but the results are quite lacking compared to the highlighted part of Table 2 below.

My results on Webvid dataset:

{'R1': 39.7868, 'R5': 64.722, 'R10': 74.2869, 'R50': 91.4578, 'R_mean': 59.5986} after 5 epochs.

Authors results: cap

I know that batch size affectcontrastive learning (nce,hn-nce...etc) , but I didn't expect this much difference, have you ever checked the difference in recall scores based on batch size? You mentioned batch size 2048 when you wrote a paper.

I ask because due to the limited GPUs available to me (MAX 48GB VLAM - A SINGLE A6000) I set the training batch to 48, which seems to have caused a big performance drop.

Again, thanks for the great research and I look forward to your response.

chan-kor commented 5 months ago

I used default setting on github, just changed batch size only.

{ "experiment": "tv-False_loss-hnnce_lr-1e-05", "run_name": "base", "seed": 1234, "logger_level": "INFO", "paths": { "root_dir": ".", "work_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master", "output_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/base", "datasets_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/datasets/", "log_dir": "./logs/" }, "val": true, "data": { "dataname": "webvid-covr", "target": "src.data.webvid_covr.WebVidCoVRDataModule", "image_size": 384, "iterate": "pth2", "vid_query_method": "middle", "vid_frames": 1, "emb_pool": "query", "dataset_dir": "/mnt/d/dataset/covr/datasets/WebVid", "batch_size": 64, "num_workers": 4, "annotation": { "train": "/mnt/d/dataset/covr/annotation/webvid-covr/webvid2m-covr_train.csv", "val": "/mnt/d/dataset/covr/annotation/webvid-covr/webvid8m-covr_val.csv" }, "vid_dirs": { "train": "/mnt/d/dataset/covr/datasets/WebVid/2M/train", "val": "/mnt/d/dataset/covr/datasets/WebVid/8M/train" }, "emb_dirs": { "train": "/mnt/d/dataset/covr/datasets/WebVid/2M/blip-vid-embs-large-all", "val": "/mnt/d/dataset/covr/datasets/WebVid/8M/blip-vid-embs-large-all" } }, "machine": { "paths": { "root_dir": ".", "work_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master", "output_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/base", "datasets_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/datasets/", "log_dir": "./logs/" }, "name": "server", "batch_size": 64, "num_workers": 4 }, "trainer": { "default_root_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/base", "max_epochs": 5, "accelerator": "gpu", "devices": 1, "precision": "32-true", "log_interval": 1, "print_interval": 10, "save_ckpt": "all", "fabric": { "target": "lightning.Fabric", "accelerator": "gpu", "devices": 1, "precision": "32-true", "loggers": { "target": "lightning.pytorch.loggers.csv_logs.CSVLogger", "save_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/base", "name": "csv/", "prefix": "" } }, "logger": { "target": "lightning.pytorch.loggers.csv_logs.CSVLogger", "save_dir": "/mnt/c/Users/bclab/Desktop/CoVR-master/outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/base", "name": "csv/", "prefix": "" } }, "test": { "cirr": { "dataname": "cirr", "target": "src.data.cirr.CIRRTestDataModule", "test": { "target": "src.test.cirr.TestCirr" }, "batch_size": 64, "num_workers": 4, "annotation": "/mnt/c/Users/bclab/Desktop/CoVR-master/annotation/cirr/cap.rc2.test1.json", "img_dirs": "/mnt/c/Users/bclab/Desktop/CoVR-master/datasets//CIRR/images/test1", "emb_dirs": "/mnt/c/Users/bclab/Desktop/CoVR-master/datasets//CIRR/blip-embs-large/test1", "image_size": 384 }, "webvid_covr": { "dataname": "webvid-covr", "target": "src.data.webvid_covr.WebVidCoVRTestDataModule", "image_size": 384, "vid_query_method": "middle", "vid_frames": 1, "emb_pool": "query", "batch_size": 64, "num_workers": 4, "annotation": "/mnt/d/dataset/covr/annotation/webvid-covr/webvid8m-covr_test.csv", "vid_dirs": "/mnt/d/dataset/covr/datasets/WebVid/8M/train", "emb_dirs": "/mnt/d/dataset/covr/datasets/WebVid/8M/blip-vid-embs-large-all", "test": { "target": "src.test.webvid_covr.TestWebVidCoVR" } },

}, "model": { "modelname": "blip-large", "target": "src.model.blip_cir.blip_cir", "ckpt_path": "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco.pth", "model": { "target": "src.model.blip_cir.BLIPCir", "med_config": "/mnt/c/Users/bclab/Desktop/CoVR-master/configs/med_config.json", "image_size": 384, "vit": "large", "vit_grad_ckpt": true, "vit_ckpt_layer": 12, "embed_dim": 256, "train_vit": false, "loss": { "target": "src.model.loss.HardNegativeNCE", "name": "hnnce", "alpha": 1, "beta": 0.5 } }, "optimizer": { "target": "torch.optim.AdamW", "partial": true, "lr": 1e-05, "weight_decay": 0.05 }, "scheduler": { "target": "src.tools.scheduler.CosineSchedule", "init_lr": 1e-05, "min_lr": 0, "decay_rate": 0.05, "max_epochs": 5 }, "loss": { "target": "src.model.loss.HardNegativeNCE", "name": "hnnce", "alpha": 1, "beta": 0.5 }, "ckpt": { "name": "blip-l-coco", "path": "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_retrieval_coco.pth" } } }

This is my experiment setting options.

lucas-ventura commented 5 months ago

Thank you for your interest in our work! Indeed, in the early stages of our project, experimenting with various batch sizes revealed some differences in performance, though the variation wasn't as pronounced as what you're observing.

I haven't tested with a batch size as small as 48, but you could try adjusting the learning rate to accommodate the smaller batch size. I think a common approach is to reduce the learning rate linearly or in square root proportion to the batch size reduction (in your case, about lr/43 or lr/6.5), and you can also experiment with a few values in between. Since contrastive learning relies on the dynamics of positive and negative samples within batches, the impact might be less significant than in other methods. I'd be curious to see how these adjustments work for you, so please keep me posted on your results!

Let me know if you have any other questions!