tjddus9597 / Proxy-Anchor-CVPR2020

Official PyTorch Implementation of Proxy Anchor Loss for Deep Metric Learning, CVPR 2020
MIT License
312 stars 60 forks source link

CUDA out of memory #14

Open ShuteLee opened 3 years ago

ShuteLee commented 3 years ago

python train.py --gpu-id 0 --loss Proxy_Anchor --model r esnet50 --embedding-size 512 --batch-size 180 --lr 6e-4 --dataset SOP --warm 1 --bn-freeze 0 --lr-decay-step 20 --lr-dec ay-gamma 0.25 wandb: Currently logged in as: shute (use wandb login --relogin to force relogin) wandb: wandb version 0.10.10 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.10.8 wandb: Syncing run eager-dream-16 wandb: ⭐️ View project at https://wandb.ai/shute/SOP_ProxyAnchor wandb: 🚀 View run at https://wandb.ai/shute/SOP_ProxyAnchor/runs/2ca7m47a wandb: Run data is saved locally in wandb/run-20201113_090754-2ca7m47a wandb: Run wandb off to turn off syncing.

Random Sampling Training parameters: {'LOG_DIR': '../logs', 'dataset': 'SOP', 'sz_embedding': 512, 'sz_batch': 180, 'nb_epochs': 60, 'gpu_id': 0, 'nb_workers': 4, 'model': 'resnet50', 'loss': 'Proxy_Anchor', 'optimizer': 'adamw', 'lr': 0.0006, 'weight_decay': 0.0001, 'lr_decay_step': 20, 'lr_decay_gamma': 0.25, 'alpha': 32, 'mrg': 0.1, 'IPC': None, 'warm': 1, 'bn_freeze': 0, 'l2_norm': 1, 'remark': ''} Training for 60 epochs. 0it [00:00, ?it/s]/home/server8/lst/Proxy-Anchor-CVPR2020-master/code/losses.py:48: UserWarning: This overload of nonzero is deprecated: nonzero(Tensor input, , Tensor out) Consider using one of the following signatures instead: nonzero(Tensor input, , bool as_tuple) (Triggered internally at /opt/conda/conda-bld/pytorch_1595629411241/work/torch/csrc/utils/python_arg_parser.cpp:766.) with_pos_proxies = torch.nonzero(P_one_hot.sum(dim = 0) != 0).squeeze(dim = 1) # The set of positive proxies of data in the batch Train Epoch: 0 [330/330 (100%)] Loss: 10.849229: : 330it [01:34, 3.49it/s] Evaluating... 100%|██████████| 337/337 [01:25<00:00, 3.95it/s] R@1 : 51.770 R@10 : 67.938 R@100 : 81.594 R@1000 : 92.909 0it [00:00, ?it/s] Traceback (most recent call last): File "train.py", line 290, in m = model(x.squeeze().cuda()) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/home/server8/lst/Proxy-Anchor-CVPR2020-master/code/net/resnet.py", line 175, in forward x = self.model.layer1(x) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward input = module(input) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torchvision/models/resnet.py", line 112, in forward out = self.conv3(out) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/home/server8/anaconda3/envs/proj/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 415, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 552.00 MiB (GPU 0; 7.80 GiB total capacity; 6.09 GiB already allocated; 392.69 MiB free; 6.44 GiB reserved in total by PyTorch)

wandb: Waiting for W&B process to finish, PID 4172153 wandb: Program failed with code 1. Press ctrl-c to abort syncing. wandb: wandb: Find user logs for this run at: wandb/run-20201113_090754-2ca7m47a/logs/debug.log wandb: Find internal logs for this run at: wandb/run-20201113_090754-2ca7m47a/logs/debug-internal.log wandb: Run summary: wandb: loss 12.3987 wandb: R@1 0.5177 wandb: R@10 0.67938 wandb: R@100 0.81594 wandb: R@1000 0.92909 wandb: _step 0 wandb: _runtime 193 wandb: _timestamp 1605276667 wandb: Run history: wandb: loss ▁ wandb: R@1 ▁ wandb: R@10 ▁ wandb: R@100 ▁ wandb: R@1000 ▁ wandb: _step ▁ wandb: _runtime ▁ wandb: _timestamp ▁ wandb: wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s) wandb: wandb: Synced eager-dream-16: https://wandb.ai/shute/SOP_ProxyAnchor/runs/2ca7m47a

CUDA out of memory in the second epoch. I have set the batch size to 30, 100, 150, 180,. Nothing helps. pytorch 1.6 CUDA 10.1 GPU RTX2080Super 8G

I have spent many hours, but still can not solve.

Many thanks for your help.

YingjieYin commented 1 year ago

I got the same problem, how do you solve it ?

tjddus9597 commented 1 year ago

If memory problems occur during training, try reducing the batch size. If you run into problems during evaluation, especially only on the SOP dataset, set the line 144, "if len(xs)<10000:" of the "evaluate_cos_SOP" function by reducing the constant value. This line determines the size to compute the cosine similarity. However, if this number is reduced, the number of loops increases, which may increase the evaluation time.

e.g., utils.py

def evaluate_cos_SOP(X, T, normalize=False): if normalize: X = l2_norm(X)

# get predictions by assigning nearest 8 neighbors with cosine
K = 1000
Y = []
xs = []
for x in X:
    if len(xs)<1000:
        xs.append(x)
    else:
        xs.append(x)            
        xs = torch.stack(xs,dim=0)
        cos_sim = F.linear(xs,X)
        y = T[cos_sim.topk(1 + K)[1][:,1:]]
        Y.append(y.float().cpu())
        xs = []

# Last Loop
xs = torch.stack(xs,dim=0)
cos_sim = F.linear(xs,X)
y = T[cos_sim.topk(1 + K)[1][:,1:]]
Y.append(y.float().cpu())
Y = torch.cat(Y, dim=0)

# calculate recall @ 1, 2, 4, 8
recall = []
for k in [1, 10, 100, 1000]:
    r_at_k = calc_recall_at_k(T, Y, k)
    recall.append(r_at_k)
    print("R@{} : {:.3f}".format(k, 100 * r_at_k))
return recall
YingjieYin commented 1 year ago

Thank you for your reply, the problem has been solved.