nvjpeg memory allocation failure #104

Closed tomsal closed 3 years ago

tomsal commented 3 years ago


I ran into an issue that the pretraining script crashes after 8.5 epochs due to an allocation failure. I am guessing there might be a memory leak somewhere.


The error I get is the following:

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
Using native 16bit precision.
dator.py:101: UserWarning: you defined a validation_step but have no val_dataloader. Skipping val loop
  rank_zero_warn(f'you defined a {step_name} but have no {loader_name}. Skipping {stage} loop')
Global seed set to 5
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
All DDP processes registered. Starting ddp with 1 processes


  | Name       | Type       | Params
0 | encoder    | ResNet     | 23.5 M 
1 | classifier | Linear     | 2.0 M
2 | projector  | Sequential | 12.6 M
3 | predictor  | Sequential | 2.1 M 
40.2 M    Trainable params
2.0 K     Non-trainable params
40.3 M    Total params
161.002   Total estimated model params size (MB)
Global seed set to 5
read 1281167 files from 1000 directories
Epoch 8:  50%|████████████████                | 13369/26690 [1:55:18<1:54:53,  1.93it/s, loss=3.67, v_num=ok1z]
After I ran into this the first time, I reran it with GPU memory logging. This is the plot I get: image

I am a bit confused that there is an increase after 3.5k steps. Let me know in case, I should provide more logs, or so.

P.S.: Great work! It is a pleasure to work with! :)

DonkeyShot21 commented 3 years ago

Are you running on a single GPU with 12GB memory? I doubt you can fit an imagenet run on that setup.

In our experiments after some time the memory usage stabilizes, so it's unlikely that this is due to a memory leak. More probably it is due to automatic mixed precision (some param groups might jump from precision 16 to precision 32) or to some internal functioning of Dali that I am not sure about.

EDIT: I see you are using batch size 48, maybe this is not the best choice. Instead, you can try to decrease the number of workers from 12 to maybe 4. This really decreases the amount of memory needed with a negligible slowdown

tomsal commented 3 years ago

Yes, I am running it on a single GPU with 12 GB memory, but as you correctly noted with batch size 48. I am aware that in terms of training results this not an ideal setup. It is still good enough for debugging code before going on a multi gpu cluster, I'd say. :)

I will try out the workers, and I am aware that, in general, this is not a major issue. Still, I thought it is good to let you know about this.

vturrisi commented 3 years ago

Just to add on what @DonkeyShot21 said, DALI's memory usage scales with the number of workers. Every 4 workers per gpu would add an overhead of ~3gb after it stabilizes. I'm not really sure why memory increases after some epochs, because it should stay pretty much the same since we pre-allocate a buffer here https://github.com/vturrisi/solo-learn/blob/532e9a516b1253c86149a01812e81dfe2bd729df/solo/utils/dali_dataloader.py#L187

According to DALI docs (https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html?highlight=host_memory_padding) this should be enough, but we always experienced a small increase in memory usage until ~epoch 60.

tomsal commented 3 years ago

Ok, thanks, that's great info! So deactivating DALI should also work out, I guess? I have to admit that I didn't really take DALI into the equation when scaling up the workers.

vturrisi commented 3 years ago

Yes, if you turn DALI off you will save ~3gbs of memory (when using 4 workers) but you will run around 50% slower. If you scale the workers a lot, I think you can get good performance, but you will use a lot of ram.