nvjpeg memory allocation failure

tomsal commented 3 years ago

Hi,

I ran into an issue that the pretraining script crashes after 8.5 epochs due to an allocation failure. I am guessing there might be a memory leak somewhere.

Details:

Nvidia Titan V GPU (12GB)
Using Nvidia Dali
commit 85b888a622ff08fcf7843c28a47f6bde51b089c6 (i will do another pull, rerun and post the result, just to be sure.)
arguments: python3 main_pretrain.py --dataset imagenet --encoder resnet50 --data_dir /data --train_dir imagenet/train --val_dir imagenet/val --max_epochs 100 --gpus 0 --distributed_backend ddp --sync_batchnorm --precision 16 --optimizer sgd --scheduler warmup_cosine --lr 0.5 --classifier_lr 0.1 --weight_decay 1e-5 --batch_size 48 --num_workers 12 --brightness 0.4 --contrast 0.4 --saturation 0.4 --hue 0.1 --zero_init_residual --name simsiam-resnet50-100ep-imagenet --dali --entity tomsal --project solo-learn --wandb --method simsiam --proj_hidden_dim 2048 --pred_hidden_dim 512 --output_dim 2048 --amp_level O2 --log_gpu_memory all
I did disable the val_loader for unrelated reasons (by val_loader = None just before line 159). No other changes were made.

The error I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/configuration_vali
dator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f'you defined a {step_name} but have no {loader_name}. Skipping {stage} loop')
Global seed set to 5
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All DDP processes registered. Starting ddp with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name       | Type       | Params
------------------------------------------
0 | encoder    | ResNet     | 23.5 M 
1 | classifier | Linear     | 2.0 M
2 | projector  | Sequential | 12.6 M
3 | predictor  | Sequential | 2.1 M 
------------------------------------------
40.2 M    Trainable params
2.0 K     Non-trainable params
40.3 M    Total params
161.002   Total estimated model params size (MB)
Global seed set to 5
read 1281167 files from 1000 directories
Epoch 8:  50%|████████████████                | 13369/26690 [1:55:18<1:54:53,  1.93it/s, loss=3.67, v_num=ok1z]
Traceback (most recent call last):
  File "main_pretrain.py", line 136, in <module>
    main()
  File "main_pretrain.py", line 130, in main
    trainer.fit(model, val_dataloaders=val_loader)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 460, in fit
    self._run(model)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 758, in _run
    self.dispatch()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 799, in dispatch
    self.accelerator.start_training(self)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/accelerators/accel
erator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/plugins/training_t
ype/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
", line 809, in run_stage
    return self.run_train()
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py
  ", line 871, in run_train           
    self.train_loop.run_training_epoch() 
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/training_l
oop.py", line 491, in run_training_epoch                                                                       
    for batch_idx, (batch, is_last_batch) in train_dataloader:   
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/profiler/profilers
.py", line 112, in profile_iterable                                                                            
    value = next(iterator)                      
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 534, in prefetch_iterator        
    for val in it:                                                                                             
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 464, in __next__                                                                                    
    return self.request_next_batch(self.loader_iters)     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/trainer/supporters
.py", line 478, in request_next_batch
    return apply_to_collection(loader_iters, Iterator, next)              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/pytorch_lightning/utilities/apply_fu
nc.py", line 85, in apply_to_collection
    return function(data, *args, **kwargs) 
  File "~/Code/solo-learn/solo/methods/dali.py", line 59, in __next__                                
    batch = super().__next__()                     
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/pytorch.py", line
 194, in __next__                                                                                              
    outputs = self._get_outputs()                                                                              
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/plugin/base_iterator.py"
, line 255, in _get_outputs                                                                                    
    outputs.append(p.share_outputs())
  File "~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/pipeline.py", line 863,
in share_outputs
    return self._pipe.ShareOutputs()
RuntimeError: Critical error in pipeline:
Error when executing Mixed operator decoders__Image encountered:                                               
Error in thread 2: [/opt/dali/dali/operators/decoder/nvjpeg/nvjpeg_decoder_decoupled_api.h:917] NVJPEG error "5
" : NVJPEG_STATUS_ALLOCATOR_FAILURE n02447366/n02447366_33293.jpg
Stacktrace (7 entries):                                                                                        
[frame 0]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x4cbbee) [0x7efc6c55dbee]                     
[frame 1]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87a63b) [0x7efc6c90c63b]                 
[frame 2]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali_operators.so(
+0x87aa2e) [0x7efc6c90ca2e]                                                                                    
[frame 3]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(dali::Thre
adPool::ThreadMain(int, int, bool)+0x1f0) [0x7efc6b5ed330]
[frame 4]: ~/miniconda3/envs/solo-learn/lib/python3.6/site-packages/nvidia/dali/libdali.so(+0x70718f)
 [0x7efc6bb9f18f]                    
[frame 5]: /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7efd013b96db]
[frame 6]: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7efd010e2a3f]                                        

Current pipeline object is no longer valid.

After I ran into this the first time, I reran it with GPU memory logging. This is the plot I get:

I am a bit confused that there is an increase after 3.5k steps (from 11979 GB to 1201GB). Let me know in case, I should provide more logs, or so.

P.S.: Great work! It is a pleasure to work with! :)

DonkeyShot21 commented 3 years ago

Are you running on a single GPU with 12GB memory? I doubt you can fit an imagenet run on that setup.

In our experiments after some time the memory usage stabilizes, so it's unlikely that this is due to a memory leak. More probably it is due to automatic mixed precision (some param groups might jump from precision 16 to precision 32) or to some internal functioning of Dali that I am not sure about.

EDIT: I see you are using batch size 48, maybe this is not the best choice. Instead, you can try to decrease the number of workers from 12 to maybe 4. This really decreases the amount of memory needed with a negligible slowdown

tomsal commented 3 years ago

Yes, I am running it on a single GPU with 12 GB memory, but as you correctly noted with batch size 48. I am aware that in terms of training results this not an ideal setup. It is still good enough for debugging code before going on a multi gpu cluster, I'd say. :)

I will try out the workers, and I am aware that, in general, this is not a major issue. Still, I thought it is good to let you know about this.

vturrisi commented 3 years ago

Just to add on what @DonkeyShot21 said, DALI's memory usage scales with the number of workers. Every 4 workers per gpu would add an overhead of ~3gb after it stabilizes. I'm not really sure why memory increases after some epochs, because it should stay pretty much the same since we pre-allocate a buffer here https://github.com/vturrisi/solo-learn/blob/532e9a516b1253c86149a01812e81dfe2bd729df/solo/utils/dali_dataloader.py#L187

According to DALI docs (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html?highlight=host_memory_padding) this should be enough, but we always experienced a small increase in memory usage until ~epoch 60.

tomsal commented 3 years ago

Ok, thanks, that's great info! So deactivating DALI should also work out, I guess? I have to admit that I didn't really take DALI into the equation when scaling up the workers.

vturrisi commented 3 years ago

Yes, if you turn DALI off you will save ~3gbs of memory (when using 4 workers) but you will run around 50% slower. If you scale the workers a lot, I think you can get good performance, but you will use a lot of ram.

vturrisi / solo-learn

nvjpeg memory allocation failure #104