The needed compute resources to train on ZSL+DG

WilliamYi96 commented 4 years ago

Hi @mancinimassimiliano,

I'm trying to reproduce your results on ZSL+DG. When I trained the model on one RTX 2080Ti even one Titan V100, it ran out of memory. Could you please tell more about the compute for your ZSL+DG experiments? It would be great if you can give an estimated training time.

Thanks so much.

Herewidth I've attached the error info:

configs/zsl+dg/painting.json

Target: painting run 1/10 Traceback (most recent call last): File "main.py", line 143, in method = CuMix(seen_classes=seen,unseen_classes=unseen,attributes=attributes,configs=configs,zsl_only = not args.dg, File "/ibex/scratch/yik/CIZSL-JournalVersion/CuMix/methods.py", line 83, in init self.backbone.to(device) File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 443, in to return self._apply(convert) File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 203, in _apply module._apply(fn) File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 225, in _apply param_applied = fn(param) File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 441, in convert return t.to(device, dtype if t.is_floating_point() else None, non_blocking) RuntimeError: CUDA error: out of memory Traceback (most recent call last): File "/home/yik/anaconda2/envs/cumix/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/yik/anaconda2/envs/cumix/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in main() File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/yik/anaconda2/envs/cumix/bin/python', '-u', 'main.py', '--local_rank=0', '--zsl', '--dg', '--target', 'painting', '--config_file', 'configs/zsl+dg/painting.json', '--data_root', '/xxx/CuMix/data/DomainNet', '--name', 'painting_exps_zsldg']' returned non-zero exit status 1.

mancinimassimiliano commented 4 years ago

Hi @WilliamYi96,

Thanks for your interest in our work! That is weird: I ran most of my experiments on a Titan X (12GB) and I checked everything was working also on a V-100. Time-wise, a single experiment for one target domain should require between 2 and 6 hours (the number of iterations depends on the smallest source), but that depends on your machine and whether your pre-process the DomainNet images.

Regarding the memory issue, could you please share the command you are using to launch the experiment?

WilliamYi96 commented 4 years ago

Hi @mancinimassimiliano,

Thanks for your reply. I'm using the following command:

python -m torch.distributed.launch --nproc_per_node=1 main.py --zsl --dg --target painting --config_file configs/zsl+dg/painting.json --data_root ...xxx...CuMix/data/DomainNet --name painting_exps_zsldg

WilliamYi96 commented 4 years ago

Now I can work with 4 V100 GPUs and the estimated training time in total will be less than 20h. Like you mentioned, maybe it's not neccessary to occupy computes like that much. But currently I don't know what I'm missing. I'm new to your given code, so I just follow your code without any modifications (e.g. extral pre-process). Can the DRAM be an issue? I currently allocate 10GB memory per GPU. For most experiments, maybe it's enough.

mancinimassimiliano commented 4 years ago

Hi @WilliamYi96,

I was not able to reproduce the issue, unfortunately. Is it still present in your machine? From the log you shared, it seems that the problem arises not during training but as soon as the feature extractor is loaded into the memory. Might be that the GPU(s)' memory is not properly freed?

p.s. note also that to use N GPUs, you should set: --nproc_per_node=N before main.

WilliamYi96 commented 4 years ago

Hi @mancinimassimiliano, thanks for your reply and kind reminder. I'm currently training with multiple GPUs and I'll train with a single GPU again after that. Will update and let you know soon.

shivam-chandhok commented 3 years ago

Hi, I am facing the same issue. Did you find a workaround for the problem?

mancinimassimiliano commented 3 years ago

Sorry for the super-late reply. I was not experiencing any problem because I tested the code on a single GPU machine with 1 V100. But, my bad: while refactoring the code I accidentally set it up to not use DistributedDataParallel (I deleted the flag activating it, thus the code is running in non-parallel mode) plus the backward pass for the image-level mixup can be held-out separately (to save almost half of the memory). I pushed my fixes in the branch memory-fix. I am double-checking that the modifications I introduced are working as expected and, if this is the case, I will merge them to master.

@WilliamYi96, @shivam-chandhok: in case you have time, it would be super-nice of you to check whether those fixes solve the problem in your cases.

WilliamYi96 commented 3 years ago

@mancinimassimiliano Thanks for your reply. I found the new branch works for me.

But it seems there is some problem with the multiple GPU training. It showed:

raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')

The full error file is as follows:

python -m torch.distributed.launch --nproc_per_node=2 --master_port=2235 main.py --zsl --dg --target quickdraw --config_file configs/zsl+dg/quickdraw.json --data_root /ibex/scratch/yik/dataset/DomainNet --name quickdraw_exps_zsldg
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
configs/zsl+dg/quickdraw.json

Target: quickdraw    run 1/10
configs/zsl+dg/quickdraw.json

Target: quickdraw    run 1/10
  0%|                                                                                                                                                                                                                                              | 0/8 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 152, in <module>
    semantic_loss, mimg_loss, mfeat_loss = method.fit(train_dataset)
  File "/ibex/scratch/yik/CZSL/DACZSL/CuMix-MemoryFix/methods.py", line 316, in fit
    preds, features = self.forward(inputs,return_features=True)
  File "/ibex/scratch/yik/CZSL/DACZSL/CuMix-MemoryFix/methods.py", line 227, in forward
    features = self.backbone(input)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward
    return self._forward_impl(x)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl
    x = self.bn1(x)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 468, in forward
    raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
  0%|                                                                                                                                                                                                                                              | 0/8 [00:07<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 152, in <module>
    semantic_loss, mimg_loss, mfeat_loss = method.fit(train_dataset)
  File "/ibex/scratch/yik/CZSL/DACZSL/CuMix-MemoryFix/methods.py", line 316, in fit
    preds, features = self.forward(inputs,return_features=True)
  File "/ibex/scratch/yik/CZSL/DACZSL/CuMix-MemoryFix/methods.py", line 227, in forward
    features = self.backbone(input)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torchvision/models/resnet.py", line 220, in forward
    return self._forward_impl(x)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torchvision/models/resnet.py", line 204, in _forward_impl
    x = self.bn1(x)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 468, in forward
    raise AttributeError('SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel')
AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
Traceback (most recent call last):
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/yik/anaconda2/envs/cumix/lib/python3.8/site-packages/torch/distributed/launch.py", line 258, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/yik/anaconda2/envs/cumix/bin/python', '-u', 'main.py', '--local_rank=1', '--zsl', '--dg', '--target', 'quickdraw', '--config_file', 'configs/zsl+dg/quickdraw.json', '--data_root', '/ibex/scratch/yik/dataset/DomainNet', '--name', 'quickdraw_exps_zsldg']' returned non-zero exit status 1.

If you have time, could you please double-check it? Looking forward to your reply!

WilliamYi96 commented 3 years ago

Besides, it will be great if some fine progress bar can be added. I mean, when I train CuMix, it shows:

python -m torch.distributed.launch --nproc_per_node=1 --master_port=2235 main.py --zsl --dg --target quickdraw --config_file configs/zsl+dg/quickdraw.json --data_root /ibex/scratch/yik/dataset/DomainNet --name quickdraw_exps_zsldg
configs/zsl+dg/quickdraw.json

Target: quickdraw    run 1/10
  0%|                                                                                                                                                                                                                                              | 0/8 [00:00<?, ?it/s]

I wait a very long time at the same progress bar. There is no sign of whether the current program is stuck or not.

mancinimassimiliano / CuMix

The needed compute resources to train on ZSL+DG #1