salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence
BSD 3-Clause "New" or "Revised" License
9.92k stars 972 forks source link

CUDA out of memory issue during validation #84

Open BhargavDodla opened 1 year ago

BhargavDodla commented 1 year ago

Dear LAVIS team,

As part of a project, we are trying to fine-tune BLIP Retrieval with a custom dataset on 2 RTX-3090 24GB GPUs. 1) We are getting the following error, mentioned below, during the evaluation part of the runner_base.py code even with low validation batch sizes like 2.

2) When we run evaluation with a very small subset of the validation set to bypass the CUDA error, we do not get an error but we observed that the number of batches remains the same for any validation batch size.

Thank you

Traceback (most recent call last):
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/train.py", line 103, in <module>
    main()
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/train.py", line 99, in main
    runner.train()
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/runners/runner_base.py", line 359, in train
    val_log = self.eval_epoch(
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/BLIP/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/runners/runner_base.py", line 457, in eval_epoch
    results = self.task.evaluation(model, data_loader)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/tasks/retrieval.py", line 34, in evaluation
    score_i2t, score_t2i = model.compute_sim_matrix(data_loader, task_cfg=self.cfg)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/models/blip_models/blip_retrieval.py", line 396, in compute_sim_matrix
    return compute_sim_matrix(model=self, data_loader=data_loader, k_test=k_test)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/models/albef_models/__init__.py", line 116, in compute_sim_matrix
    for samples in data_loader:
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/datasets/dataloader_utils.py", line 71, in __iter__
    batch = self.next(loader_it)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/datasets/dataloader_utils.py", line 106, in next
    self.preload(it)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/datasets/dataloader_utils.py", line 92, in preload
    self.batch = move_to_cuda(self.batch)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 73, in move_to_cuda
    return apply_to_sample(_move_to_cuda, sample)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 66, in apply_to_sample
    return _apply(sample)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 60, in _apply
    return {key: _apply(value) for key, value in x.items()}
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 60, in <dictcomp>
    return {key: _apply(value) for key, value in x.items()}
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 58, in _apply
    return f(x)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/LAVIS/lavis/datasets/data_utils.py", line 71, in _move_to_cuda
    return tensor.cuda()
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 23.70 GiB total capacity; 20.99 GiB already allocated; 10.81 MiB free; 21.98 GiB reserved in total by PyTorch) If reserved mem
ory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
dxli94 commented 1 year ago

Hi @BhargavDodla, thanks for raising the question.

Thanks.

verigle commented 1 year ago

I hava similar problem, the GPU memory always increase, and then Out of Memory.

Evaluation  [   0/5000]  eta: 1:57:37    time: 1.4116  data: 0.3283  max mem: 9518
Evaluation  [  10/5000]  eta: 0:14:17    time: 0.1719  data: 0.0302  max mem: 9555
Evaluation  [  20/5000]  eta: 0:09:24    time: 0.0484  data: 0.0003  max mem: 9594
Evaluation  [  30/5000]  eta: 0:07:41    time: 0.0492  data: 0.0004  max mem: 9633
Evaluation  [  40/5000]  eta: 0:06:50    time: 0.0507  data: 0.0004  max mem: 9671
Evaluation  [  50/5000]  eta: 0:06:15    time: 0.0499  data: 0.0004  max mem: 9709
Evaluation  [  60/5000]  eta: 0:05:53    time: 0.0488  data: 0.0004  max mem: 9747
Evaluation  [  70/5000]  eta: 0:05:38    time: 0.0498  data: 0.0004  max mem: 9785
Evaluation  [  80/5000]  eta: 0:05:25    time: 0.0492  data: 0.0004  max mem: 9824
Evaluation  [  90/5000]  eta: 0:05:14    time: 0.0482  data: 0.0003  max mem: 9862
Evaluation  [ 100/5000]  eta: 0:05:06    time: 0.0482  data: 0.0003  max mem: 9900
Evaluation  [ 110/5000]  eta: 0:04:59    time: 0.0483  data: 0.0003  max mem: 9938
Evaluation  [ 120/5000]  eta: 0:04:53    time: 0.0482  data: 0.0003  max mem: 9976
Evaluation  [ 130/5000]  eta: 0:04:48    time: 0.0484  data: 0.0003  max mem: 10015
Evaluation  [ 140/5000]  eta: 0:04:44    time: 0.0484  data: 0.0003  max mem: 10053
Evaluation  [ 150/5000]  eta: 0:04:40    time: 0.0484  data: 0.0003  max mem: 10091
Evaluation  [ 160/5000]  eta: 0:04:37    time: 0.0489  data: 0.0003  max mem: 10129
Evaluation  [ 170/5000]  eta: 0:04:34    time: 0.0492  data: 0.0004  max mem: 10166
Evaluation  [ 180/5000]  eta: 0:04:32    time: 0.0494  data: 0.0004  max mem: 10204
Evaluation  [ 190/5000]  eta: 0:04:29    time: 0.0493  data: 0.0004  max mem: 10242
Evaluation  [ 200/5000]  eta: 0:04:27    time: 0.0485  data: 0.0003  max mem: 10280
Evaluation  [ 210/5000]  eta: 0:04:24    time: 0.0484  data: 0.0003  max mem: 10315
Evaluation  [ 220/5000]  eta: 0:04:22    time: 0.0484  data: 0.0003  max mem: 10352
Evaluation  [ 230/5000]  eta: 0:04:21    time: 0.0497  data: 0.0004  max mem: 10388
Evaluation  [ 240/5000]  eta: 0:04:21    time: 0.0545  data: 0.0005  max mem: 10425
Evaluation  [ 250/5000]  eta: 0:04:20    time: 0.0553  data: 0.0006  max mem: 10463
Evaluation  [ 260/5000]  eta: 0:04:19    time: 0.0513  data: 0.0004  max mem: 10501
Evaluation  [ 270/5000]  eta: 0:04:17    time: 0.0494  data: 0.0004  max mem: 10537
Evaluation  [ 280/5000]  eta: 0:04:16    time: 0.0486  data: 0.0003  max mem: 10576
Evaluation  [ 290/5000]  eta: 0:04:14    time: 0.0486  data: 0.0004  max mem: 10613
Evaluation  [ 300/5000]  eta: 0:04:13    time: 0.0493  data: 0.0004  max mem: 10650
Evaluation  [ 310/5000]  eta: 0:04:12    time: 0.0499  data: 0.0004  max mem: 10687
Evaluation  [ 320/5000]  eta: 0:04:12    time: 0.0539  data: 0.0004  max mem: 10724
Evaluation  [ 330/5000]  eta: 0:04:10    time: 0.0517  data: 0.0005  max mem: 10761
Evaluation  [ 340/5000]  eta: 0:04:09    time: 0.0475  data: 0.0004  max mem: 10798
Evaluation  [ 350/5000]  eta: 0:04:08    time: 0.0515  data: 0.0004  max mem: 10835
Evaluation  [ 360/5000]  eta: 0:04:07    time: 0.0510  data: 0.0004  max mem: 10873
Evaluation  [ 370/5000]  eta: 0:04:06    time: 0.0487  data: 0.0003  max mem: 10910
Evaluation  [ 380/5000]  eta: 0:04:05    time: 0.0487  data: 0.0003  max mem: 10947
Evaluation  [ 390/5000]  eta: 0:04:05    time: 0.0518  data: 0.0004  max mem: 10985
Evaluation  [ 400/5000]  eta: 0:04:04    time: 0.0518  data: 0.0004  max mem: 11022
Evaluation  [ 410/5000]  eta: 0:04:03    time: 0.0487  data: 0.0003  max mem: 11060
Evaluation  [ 420/5000]  eta: 0:04:02    time: 0.0488  data: 0.0003  max mem: 11097
Evaluation  [ 430/5000]  eta: 0:04:01    time: 0.0488  data: 0.0003  max mem: 11134
Evaluation  [ 440/5000]  eta: 0:04:00    time: 0.0487  data: 0.0003  max mem: 11172
Evaluation  [ 450/5000]  eta: 0:03:59    time: 0.0507  data: 0.0004  max mem: 11209
Evaluation  [ 460/5000]  eta: 0:03:58    time: 0.0506  data: 0.0004  max mem: 11247
Evaluation  [ 470/5000]  eta: 0:03:58    time: 0.0495  data: 0.0004  max mem: 11284
Evaluation  [ 480/5000]  eta: 0:03:57    time: 0.0495  data: 0.0004  max mem: 11322
Evaluation  [ 490/5000]  eta: 0:03:56    time: 0.0486  data: 0.0003  max mem: 11359
Evaluation  [ 500/5000]  eta: 0:03:55    time: 0.0485  data: 0.0003  max mem: 11397
Evaluation  [ 510/5000]  eta: 0:03:54    time: 0.0486  data: 0.0003  max mem: 11434
Evaluation  [ 520/5000]  eta: 0:03:53    time: 0.0486  data: 0.0003  max mem: 11471
Evaluation  [ 530/5000]  eta: 0:03:52    time: 0.0486  data: 0.0003  max mem: 11509
Evaluation  [ 540/5000]  eta: 0:03:52    time: 0.0487  data: 0.0003  max mem: 11546
Evaluation  [ 550/5000]  eta: 0:03:51    time: 0.0485  data: 0.0003  max mem: 11584
Evaluation  [ 560/5000]  eta: 0:03:50    time: 0.0486  data: 0.0003  max mem: 11621
Evaluation  [ 570/5000]  eta: 0:03:50    time: 0.0526  data: 0.0004  max mem: 11659
Evaluation  [ 580/5000]  eta: 0:03:49    time: 0.0525  data: 0.0004  max mem: 11696
Evaluation  [ 590/5000]  eta: 0:03:48    time: 0.0488  data: 0.0003  max mem: 11734
Evaluation  [ 600/5000]  eta: 0:03:48    time: 0.0489  data: 0.0003  max mem: 11771
Evaluation  [ 610/5000]  eta: 0:03:47    time: 0.0487  data: 0.0003  max mem: 11809
Evaluation  [ 620/5000]  eta: 0:03:46    time: 0.0488  data: 0.0003  max mem: 11846
Evaluation  [ 630/5000]  eta: 0:03:45    time: 0.0487  data: 0.0003  max mem: 11883
Evaluation  [ 640/5000]  eta: 0:03:45    time: 0.0487  data: 0.0003  max mem: 11921
Evaluation  [ 650/5000]  eta: 0:03:44    time: 0.0489  data: 0.0004  max mem: 11958
Evaluation  [ 660/5000]  eta: 0:03:43    time: 0.0488  data: 0.0003  max mem: 11996
Evaluation  [ 670/5000]  eta: 0:03:43    time: 0.0487  data: 0.0003  max mem: 12033
Evaluation  [ 680/5000]  eta: 0:03:42    time: 0.0488  data: 0.0003  max mem: 12071
Evaluation  [ 690/5000]  eta: 0:03:41    time: 0.0488  data: 0.0003  max mem: 12108
Evaluation  [ 700/5000]  eta: 0:03:41    time: 0.0488  data: 0.0003  max mem: 12146
Evaluation  [ 710/5000]  eta: 0:03:40    time: 0.0490  data: 0.0003  max mem: 12183
Evaluation  [ 720/5000]  eta: 0:03:39    time: 0.0490  data: 0.0003  max mem: 12220
Evaluation  [ 730/5000]  eta: 0:03:39    time: 0.0488  data: 0.0003  max mem: 12258
Evaluation  [ 740/5000]  eta: 0:03:38    time: 0.0489  data: 0.0003  max mem: 12295
Evaluation  [ 750/5000]  eta: 0:03:37    time: 0.0491  data: 0.0003  max mem: 12333
Evaluation  [ 760/5000]  eta: 0:03:37    time: 0.0488  data: 0.0003  max mem: 12370
Evaluation  [ 770/5000]  eta: 0:03:36    time: 0.0488  data: 0.0003  max mem: 12408
Evaluation  [ 780/5000]  eta: 0:03:35    time: 0.0490  data: 0.0004  max mem: 12445
Evaluation  [ 790/5000]  eta: 0:03:35    time: 0.0489  data: 0.0003  max mem: 12483
Evaluation  [ 800/5000]  eta: 0:03:34    time: 0.0487  data: 0.0003  max mem: 12520
Evaluation  [ 810/5000]  eta: 0:03:33    time: 0.0488  data: 0.0003  max mem: 12558
Evaluation  [ 820/5000]  eta: 0:03:33    time: 0.0488  data: 0.0003  max mem: 12595
Evaluation  [ 830/5000]  eta: 0:03:32    time: 0.0488  data: 0.0003  max mem: 12632
Evaluation  [ 840/5000]  eta: 0:03:32    time: 0.0489  data: 0.0003  max mem: 12670
Evaluation  [ 850/5000]  eta: 0:03:31    time: 0.0490  data: 0.0004  max mem: 12707
Evaluation  [ 860/5000]  eta: 0:03:30    time: 0.0490  data: 0.0004  max mem: 12745
Evaluation  [ 870/5000]  eta: 0:03:30    time: 0.0489  data: 0.0004  max mem: 12782
Evaluation  [ 880/5000]  eta: 0:03:29    time: 0.0489  data: 0.0004  max mem: 12820
Evaluation  [ 890/5000]  eta: 0:03:29    time: 0.0488  data: 0.0003  max mem: 12857
Evaluation  [ 900/5000]  eta: 0:03:28    time: 0.0508  data: 0.0004  max mem: 12895
Evaluation  [ 910/5000]  eta: 0:03:28    time: 0.0508  data: 0.0004  max mem: 12932
Evaluation  [ 920/5000]  eta: 0:03:27    time: 0.0488  data: 0.0003  max mem: 12969
Evaluation  [ 930/5000]  eta: 0:03:26    time: 0.0489  data: 0.0003  max mem: 13007
Evaluation  [ 940/5000]  eta: 0:03:26    time: 0.0527  data: 0.0004  max mem: 13044
Evaluation  [ 950/5000]  eta: 0:03:26    time: 0.0527  data: 0.0004  max mem: 13082
Evaluation  [ 960/5000]  eta: 0:03:25    time: 0.0490  data: 0.0003  max mem: 13119
Evaluation  [ 970/5000]  eta: 0:03:24    time: 0.0489  data: 0.0003  max mem: 13157
Evaluation  [ 980/5000]  eta: 0:03:24    time: 0.0488  data: 0.0003  max mem: 13194
Evaluation  [ 990/5000]  eta: 0:03:23    time: 0.0489  data: 0.0003  max mem: 13232
Evaluation  [1000/5000]  eta: 0:03:23    time: 0.0489  data: 0.0003  max mem: 13269
Evaluation  [1010/5000]  eta: 0:03:22    time: 0.0490  data: 0.0003  max mem: 13307
Evaluation  [1020/5000]  eta: 0:03:21    time: 0.0490  data: 0.0003  max mem: 13344
Evaluation  [1030/5000]  eta: 0:03:21    time: 0.0492  data: 0.0003  max mem: 13381
Evaluation  [1040/5000]  eta: 0:03:20    time: 0.0492  data: 0.0003  max mem: 13419
Evaluation  [1050/5000]  eta: 0:03:20    time: 0.0489  data: 0.0003  max mem: 13456
Evaluation  [1060/5000]  eta: 0:03:19    time: 0.0488  data: 0.0003  max mem: 13494
Evaluation  [1070/5000]  eta: 0:03:19    time: 0.0489  data: 0.0003  max mem: 13531
Evaluation  [1080/5000]  eta: 0:03:18    time: 0.0509  data: 0.0004  max mem: 13569
Evaluation  [1090/5000]  eta: 0:03:18    time: 0.0508  data: 0.0004  max mem: 13606
Evaluation  [1100/5000]  eta: 0:03:17    time: 0.0489  data: 0.0003  max mem: 13644
Evaluation  [1110/5000]  eta: 0:03:16    time: 0.0489  data: 0.0003  max mem: 13681
Evaluation  [1120/5000]  eta: 0:03:16    time: 0.0489  data: 0.0003  max mem: 13718
Evaluation  [1130/5000]  eta: 0:03:15    time: 0.0489  data: 0.0003  max mem: 13756
Evaluation  [1140/5000]  eta: 0:03:15    time: 0.0489  data: 0.0003  max mem: 13793
Evaluation  [1150/5000]  eta: 0:03:14    time: 0.0493  data: 0.0003  max mem: 13831
Evaluation  [1160/5000]  eta: 0:03:14    time: 0.0488  data: 0.0004  max mem: 13868
Evaluation  [1170/5000]  eta: 0:03:13    time: 0.0515  data: 0.0005  max mem: 13906
Evaluation  [1180/5000]  eta: 0:03:13    time: 0.0511  data: 0.0005  max mem: 13943
Evaluation  [1190/5000]  eta: 0:03:12    time: 0.0483  data: 0.0003  max mem: 13981
Evaluation  [1200/5000]  eta: 0:03:12    time: 0.0494  data: 0.0003  max mem: 14018
Evaluation  [1210/5000]  eta: 0:03:11    time: 0.0538  data: 0.0004  max mem: 14056
Evaluation  [1220/5000]  eta: 0:03:11    time: 0.0538  data: 0.0004  max mem: 14093
Evaluation  [1230/5000]  eta: 0:03:10    time: 0.0495  data: 0.0004  max mem: 14130
Evaluation  [1240/5000]  eta: 0:03:10    time: 0.0497  data: 0.0004  max mem: 14168
Evaluation  [1250/5000]  eta: 0:03:09    time: 0.0497  data: 0.0004  max mem: 14205
Evaluation  [1260/5000]  eta: 0:03:09    time: 0.0496  data: 0.0004  max mem: 14243
Evaluation  [1270/5000]  eta: 0:03:08    time: 0.0499  data: 0.0004  max mem: 14280
Evaluation  [1280/5000]  eta: 0:03:08    time: 0.0505  data: 0.0004  max mem: 14318
Evaluation  [1290/5000]  eta: 0:03:07    time: 0.0506  data: 0.0004  max mem: 14355
Evaluation  [1300/5000]  eta: 0:03:07    time: 0.0538  data: 0.0004  max mem: 14393
Evaluation  [1310/5000]  eta: 0:03:06    time: 0.0531  data: 0.0004  max mem: 14430
Evaluation  [1320/5000]  eta: 0:03:06    time: 0.0493  data: 0.0004  max mem: 14467
Evaluation  [1330/5000]  eta: 0:03:05    time: 0.0495  data: 0.0004  max mem: 14505
Evaluation  [1340/5000]  eta: 0:03:05    time: 0.0495  data: 0.0003  max mem: 14542
Evaluation  [1350/5000]  eta: 0:03:04    time: 0.0495  data: 0.0003  max mem: 14580
Evaluation  [1360/5000]  eta: 0:03:04    time: 0.0495  data: 0.0004  max mem: 14617
Evaluation  [1370/5000]  eta: 0:03:03    time: 0.0495  data: 0.0004  max mem: 14655
Evaluation  [1380/5000]  eta: 0:03:02    time: 0.0495  data: 0.0003  max mem: 14692
Evaluation  [1390/5000]  eta: 0:03:02    time: 0.0495  data: 0.0003  max mem: 14730
Evaluation  [1400/5000]  eta: 0:03:01    time: 0.0494  data: 0.0003  max mem: 14767
Evaluation  [1410/5000]  eta: 0:03:01    time: 0.0494  data: 0.0003  max mem: 14805
Evaluation  [1420/5000]  eta: 0:03:00    time: 0.0494  data: 0.0003  max mem: 14842
Evaluation  [1430/5000]  eta: 0:03:00    time: 0.0494  data: 0.0003  max mem: 14879
Evaluation  [1440/5000]  eta: 0:02:59    time: 0.0494  data: 0.0003  max mem: 14917
Evaluation  [1450/5000]  eta: 0:02:59    time: 0.0506  data: 0.0004  max mem: 14954
Evaluation  [1460/5000]  eta: 0:02:58    time: 0.0506  data: 0.0004  max mem: 14992
Evaluation  [1470/5000]  eta: 0:02:58    time: 0.0496  data: 0.0004  max mem: 15029
Evaluation  [1480/5000]  eta: 0:02:57    time: 0.0495  data: 0.0004  max mem: 15067
Evaluation  [1490/5000]  eta: 0:02:57    time: 0.0521  data: 0.0004  max mem: 15104
Evaluation  [1500/5000]  eta: 0:02:56    time: 0.0521  data: 0.0004  max mem: 15142
Evaluation  [1510/5000]  eta: 0:02:56    time: 0.0496  data: 0.0004  max mem: 15179
Evaluation  [1520/5000]  eta: 0:02:55    time: 0.0495  data: 0.0004  max mem: 15216
Evaluation  [1530/5000]  eta: 0:02:55    time: 0.0495  data: 0.0004  max mem: 15254
Evaluation  [1540/5000]  eta: 0:02:54    time: 0.0495  data: 0.0004  max mem: 15291
Evaluation  [1550/5000]  eta: 0:02:54    time: 0.0495  data: 0.0004  max mem: 15329
Evaluation  [1560/5000]  eta: 0:02:53    time: 0.0495  data: 0.0003  max mem: 15366
Evaluation  [1570/5000]  eta: 0:02:53    time: 0.0495  data: 0.0003  max mem: 15404
Evaluation  [1580/5000]  eta: 0:02:52    time: 0.0495  data: 0.0004  max mem: 15441
Evaluation  [1590/5000]  eta: 0:02:52    time: 0.0495  data: 0.0004  max mem: 15479
Evaluation  [1600/5000]  eta: 0:02:51    time: 0.0495  data: 0.0003  max mem: 15516
Evaluation  [1610/5000]  eta: 0:02:50    time: 0.0495  data: 0.0003  max mem: 15554
Evaluation  [1620/5000]  eta: 0:02:50    time: 0.0494  data: 0.0003  max mem: 15591
Evaluation  [1630/5000]  eta: 0:02:49    time: 0.0494  data: 0.0003  max mem: 15628
Evaluation  [1640/5000]  eta: 0:02:49    time: 0.0494  data: 0.0003  max mem: 15666
Evaluation  [1650/5000]  eta: 0:02:48    time: 0.0494  data: 0.0003  max mem: 15703
Evaluation  [1660/5000]  eta: 0:02:48    time: 0.0497  data: 0.0004  max mem: 15741
Evaluation  [1670/5000]  eta: 0:02:47    time: 0.0526  data: 0.0004  max mem: 15778
Evaluation  [1680/5000]  eta: 0:02:47    time: 0.0546  data: 0.0005  max mem: 15816
Evaluation  [1690/5000]  eta: 0:02:47    time: 0.0517  data: 0.0004  max mem: 15853
Evaluation  [1700/5000]  eta: 0:02:46    time: 0.0496  data: 0.0003  max mem: 15891
Evaluation  [1710/5000]  eta: 0:02:45    time: 0.0495  data: 0.0003  max mem: 15928
Evaluation  [1720/5000]  eta: 0:02:45    time: 0.0513  data: 0.0004  max mem: 15965
Evaluation  [1730/5000]  eta: 0:02:44    time: 0.0511  data: 0.0004  max mem: 16003
Evaluation  [1740/5000]  eta: 0:02:44    time: 0.0496  data: 0.0004  max mem: 16040
Evaluation  [1750/5000]  eta: 0:02:43    time: 0.0498  data: 0.0004  max mem: 16078
Evaluation  [1760/5000]  eta: 0:02:43    time: 0.0499  data: 0.0004  max mem: 16115
Evaluation  [1770/5000]  eta: 0:02:42    time: 0.0498  data: 0.0004  max mem: 16153
Evaluation  [1780/5000]  eta: 0:02:42    time: 0.0498  data: 0.0003  max mem: 16190
Evaluation  [1790/5000]  eta: 0:02:41    time: 0.0497  data: 0.0003  max mem: 16228
Evaluation  [1800/5000]  eta: 0:02:41    time: 0.0497  data: 0.0003  max mem: 16265
Evaluation  [1810/5000]  eta: 0:02:40    time: 0.0497  data: 0.0004  max mem: 16303
Evaluation  [1820/5000]  eta: 0:02:40    time: 0.0496  data: 0.0004  max mem: 16340
Evaluation  [1830/5000]  eta: 0:02:39    time: 0.0497  data: 0.0004  max mem: 16377
Evaluation  [1840/5000]  eta: 0:02:39    time: 0.0496  data: 0.0003  max mem: 16415
Evaluation  [1850/5000]  eta: 0:02:38    time: 0.0497  data: 0.0004  max mem: 16452
Evaluation  [1860/5000]  eta: 0:02:38    time: 0.0497  data: 0.0004  max mem: 16490
Evaluation  [1870/5000]  eta: 0:02:37    time: 0.0497  data: 0.0003  max mem: 16527
Evaluation  [1880/5000]  eta: 0:02:37    time: 0.0497  data: 0.0003  max mem: 16565
Evaluation  [1890/5000]  eta: 0:02:36    time: 0.0497  data: 0.0004  max mem: 16602
Evaluation  [1900/5000]  eta: 0:02:36    time: 0.0497  data: 0.0004  max mem: 16640
Evaluation  [1910/5000]  eta: 0:02:35    time: 0.0497  data: 0.0003  max mem: 16677
Evaluation  [1920/5000]  eta: 0:02:35    time: 0.0496  data: 0.0003  max mem: 16714
Evaluation  [1930/5000]  eta: 0:02:34    time: 0.0497  data: 0.0003  max mem: 16752
Evaluation  [1940/5000]  eta: 0:02:34    time: 0.0497  data: 0.0003  max mem: 16789
Evaluation  [1950/5000]  eta: 0:02:33    time: 0.0498  data: 0.0004  max mem: 16827
Evaluation  [1960/5000]  eta: 0:02:33    time: 0.0497  data: 0.0004  max mem: 16864
Evaluation  [1970/5000]  eta: 0:02:32    time: 0.0497  data: 0.0003  max mem: 16902
Evaluation  [1980/5000]  eta: 0:02:32    time: 0.0497  data: 0.0003  max mem: 16939
Evaluation  [1990/5000]  eta: 0:02:31    time: 0.0497  data: 0.0004  max mem: 16977
Evaluation  [2000/5000]  eta: 0:02:31    time: 0.0497  data: 0.0003  max mem: 17014
Evaluation  [2010/5000]  eta: 0:02:30    time: 0.0518  data: 0.0003  max mem: 17052
Evaluation  [2020/5000]  eta: 0:02:30    time: 0.0517  data: 0.0003  max mem: 17089
Evaluation  [2030/5000]  eta: 0:02:29    time: 0.0497  data: 0.0003  max mem: 17126
Evaluation  [2040/5000]  eta: 0:02:29    time: 0.0498  data: 0.0003  max mem: 17164
Evaluation  [2050/5000]  eta: 0:02:28    time: 0.0498  data: 0.0003  max mem: 17201
Evaluation  [2060/5000]  eta: 0:02:28    time: 0.0498  data: 0.0003  max mem: 17239
Evaluation  [2070/5000]  eta: 0:02:27    time: 0.0497  data: 0.0003  max mem: 17276
Evaluation  [2080/5000]  eta: 0:02:27    time: 0.0497  data: 0.0003  max mem: 17314
Evaluation  [2090/5000]  eta: 0:02:26    time: 0.0497  data: 0.0003  max mem: 17351
Evaluation  [2100/5000]  eta: 0:02:25    time: 0.0497  data: 0.0004  max mem: 17389
Evaluation  [2110/5000]  eta: 0:02:25    time: 0.0497  data: 0.0004  max mem: 17426
Evaluation  [2120/5000]  eta: 0:02:25    time: 0.0540  data: 0.0004  max mem: 17463
Evaluation  [2130/5000]  eta: 0:02:24    time: 0.0539  data: 0.0004  max mem: 17501
Evaluation  [2140/5000]  eta: 0:02:24    time: 0.0497  data: 0.0004  max mem: 17538
Evaluation  [2150/5000]  eta: 0:02:23    time: 0.0496  data: 0.0004  max mem: 17576
Evaluation  [2160/5000]  eta: 0:02:23    time: 0.0509  data: 0.0004  max mem: 17613
Evaluation  [2170/5000]  eta: 0:02:22    time: 0.0511  data: 0.0004  max mem: 17651
Evaluation  [2180/5000]  eta: 0:02:22    time: 0.0498  data: 0.0004  max mem: 17688
Evaluation  [2190/5000]  eta: 0:02:21    time: 0.0519  data: 0.0004  max mem: 17726
Evaluation  [2200/5000]  eta: 0:02:21    time: 0.0526  data: 0.0004  max mem: 17763
Evaluation  [2210/5000]  eta: 0:02:20    time: 0.0526  data: 0.0004  max mem: 17801
Evaluation  [2220/5000]  eta: 0:02:20    time: 0.0517  data: 0.0004  max mem: 17838
Evaluation  [2230/5000]  eta: 0:02:19    time: 0.0497  data: 0.0003  max mem: 17875
Evaluation  [2240/5000]  eta: 0:02:19    time: 0.0514  data: 0.0004  max mem: 17913
Evaluation  [2250/5000]  eta: 0:02:18    time: 0.0511  data: 0.0004  max mem: 17950
Evaluation  [2260/5000]  eta: 0:02:18    time: 0.0498  data: 0.0003  max mem: 17988
Evaluation  [2270/5000]  eta: 0:02:17    time: 0.0554  data: 0.0005  max mem: 18025
Evaluation  [2280/5000]  eta: 0:02:17    time: 0.0551  data: 0.0005  max mem: 18063
Evaluation  [2290/5000]  eta: 0:02:16    time: 0.0526  data: 0.0003  max mem: 18100
Evaluation  [2300/5000]  eta: 0:02:16    time: 0.0512  data: 0.0003  max mem: 18138
Evaluation  [2310/5000]  eta: 0:02:15    time: 0.0481  data: 0.0004  max mem: 18175
Evaluation  [2320/5000]  eta: 0:02:15    time: 0.0498  data: 0.0004  max mem: 18212
Evaluation  [2330/5000]  eta: 0:02:14    time: 0.0499  data: 0.0003  max mem: 18250
Evaluation  [2340/5000]  eta: 0:02:14    time: 0.0498  data: 0.0004  max mem: 18287
Evaluation  [2350/5000]  eta: 0:02:13    time: 0.0512  data: 0.0004  max mem: 18325
Evaluation  [2360/5000]  eta: 0:02:13    time: 0.0511  data: 0.0004  max mem: 18362
Evaluation  [2370/5000]  eta: 0:02:12    time: 0.0499  data: 0.0004  max mem: 18400
Evaluation  [2380/5000]  eta: 0:02:12    time: 0.0499  data: 0.0004  max mem: 18437
Evaluation  [2390/5000]  eta: 0:02:11    time: 0.0533  data: 0.0004  max mem: 18475
Evaluation  [2400/5000]  eta: 0:02:11    time: 0.0517  data: 0.0004  max mem: 18512
Evaluation  [2410/5000]  eta: 0:02:10    time: 0.0489  data: 0.0003  max mem: 18550
Evaluation  [2420/5000]  eta: 0:02:10    time: 0.0505  data: 0.0003  max mem: 18587
Evaluation  [2430/5000]  eta: 0:02:09    time: 0.0499  data: 0.0003  max mem: 18624
Evaluation  [2440/5000]  eta: 0:02:09    time: 0.0499  data: 0.0004  max mem: 18662
Evaluation  [2450/5000]  eta: 0:02:08    time: 0.0498  data: 0.0004  max mem: 18699
Evaluation  [2460/5000]  eta: 0:02:08    time: 0.0499  data: 0.0004  max mem: 18737
Evaluation  [2470/5000]  eta: 0:02:07    time: 0.0534  data: 0.0004  max mem: 18774
Evaluation  [2480/5000]  eta: 0:02:07    time: 0.0517  data: 0.0004  max mem: 18812
Evaluation  [2490/5000]  eta: 0:02:06    time: 0.0482  data: 0.0004  max mem: 18849
Evaluation  [2500/5000]  eta: 0:02:06    time: 0.0499  data: 0.0004  max mem: 18887
Evaluation  [2510/5000]  eta: 0:02:05    time: 0.0498  data: 0.0004  max mem: 18924
Evaluation  [2520/5000]  eta: 0:02:05    time: 0.0510  data: 0.0004  max mem: 18961
Evaluation  [2530/5000]  eta: 0:02:04    time: 0.0509  data: 0.0004  max mem: 18999
Evaluation  [2540/5000]  eta: 0:02:04    time: 0.0499  data: 0.0004  max mem: 19036
Evaluation  [2550/5000]  eta: 0:02:03    time: 0.0499  data: 0.0004  max mem: 19074
Evaluation  [2560/5000]  eta: 0:02:03    time: 0.0498  data: 0.0004  max mem: 19111
Evaluation  [2570/5000]  eta: 0:02:02    time: 0.0499  data: 0.0004  max mem: 19149
Evaluation  [2580/5000]  eta: 0:02:02    time: 0.0512  data: 0.0004  max mem: 19186
Evaluation  [2590/5000]  eta: 0:02:01    time: 0.0512  data: 0.0004  max mem: 19224
Evaluation  [2600/5000]  eta: 0:02:01    time: 0.0498  data: 0.0004  max mem: 19261
Evaluation  [2610/5000]  eta: 0:02:00    time: 0.0498  data: 0.0004  max mem: 19299
Evaluation  [2620/5000]  eta: 0:02:00    time: 0.0499  data: 0.0004  max mem: 19336
Evaluation  [2630/5000]  eta: 0:01:59    time: 0.0527  data: 0.0004  max mem: 19373
Evaluation  [2640/5000]  eta: 0:01:59    time: 0.0504  data: 0.0004  max mem: 19411
Evaluation  [2650/5000]  eta: 0:01:58    time: 0.0480  data: 0.0004  max mem: 19448
Evaluation  [2660/5000]  eta: 0:01:58    time: 0.0501  data: 0.0003  max mem: 19486
Evaluation  [2670/5000]  eta: 0:01:57    time: 0.0512  data: 0.0004  max mem: 19523
Evaluation  [2680/5000]  eta: 0:01:57    time: 0.0511  data: 0.0004  max mem: 19561
Evaluation  [2690/5000]  eta: 0:01:56    time: 0.0539  data: 0.0004  max mem: 19598
Evaluation  [2700/5000]  eta: 0:01:56    time: 0.0539  data: 0.0005  max mem: 19636
Evaluation  [2710/5000]  eta: 0:01:55    time: 0.0497  data: 0.0004  max mem: 19673
Evaluation  [2720/5000]  eta: 0:01:55    time: 0.0498  data: 0.0003  max mem: 19710
Evaluation  [2730/5000]  eta: 0:01:54    time: 0.0500  data: 0.0004  max mem: 19748
Evaluation  [2740/5000]  eta: 0:01:54    time: 0.0499  data: 0.0004  max mem: 19785
Evaluation  [2750/5000]  eta: 0:01:53    time: 0.0506  data: 0.0004  max mem: 19823
Evaluation  [2760/5000]  eta: 0:01:53    time: 0.0505  data: 0.0003  max mem: 19860
Evaluation  [2770/5000]  eta: 0:01:52    time: 0.0498  data: 0.0004  max mem: 19898
Evaluation  [2780/5000]  eta: 0:01:52    time: 0.0499  data: 0.0004  max mem: 19935
Evaluation  [2790/5000]  eta: 0:01:51    time: 0.0499  data: 0.0004  max mem: 19973
Evaluation  [2800/5000]  eta: 0:01:51    time: 0.0500  data: 0.0004  max mem: 20010
Evaluation  [2810/5000]  eta: 0:01:50    time: 0.0560  data: 0.0004  max mem: 20048
Evaluation  [2820/5000]  eta: 0:01:50    time: 0.0577  data: 0.0005  max mem: 20085
Evaluation  [2830/5000]  eta: 0:01:49    time: 0.0517  data: 0.0004  max mem: 20122
Evaluation  [2840/5000]  eta: 0:01:49    time: 0.0499  data: 0.0004  max mem: 20160
Evaluation  [2850/5000]  eta: 0:01:48    time: 0.0498  data: 0.0004  max mem: 20197
Evaluation  [2860/5000]  eta: 0:01:48    time: 0.0498  data: 0.0004  max mem: 20235
Evaluation  [2870/5000]  eta: 0:01:47    time: 0.0510  data: 0.0003  max mem: 20272
Evaluation  [2880/5000]  eta: 0:01:47    time: 0.0523  data: 0.0004  max mem: 20310
Evaluation  [2890/5000]  eta: 0:01:46    time: 0.0505  data: 0.0004  max mem: 20347
Evaluation  [2900/5000]  eta: 0:01:46    time: 0.0492  data: 0.0004  max mem: 20384
Evaluation  [2910/5000]  eta: 0:01:45    time: 0.0499  data: 0.0004  max mem: 20422
Evaluation  [2920/5000]  eta: 0:01:45    time: 0.0499  data: 0.0004  max mem: 20459
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 23.67 GiB total capacity; 19.97 GiB already allocated; 2.75 MiB free; 22.22 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 465518) of binary: /home/verigle/miniconda3/envs/lavis/bin/python
Traceback (most recent call last):
MeinhardMark commented 10 months ago

Same situation here. Not only during evaluation, but also during the training process. Did you solve your problem?