Pre-calculate Pseudo-labels for the CLIP Guidance Loss

Hi, I’m encountering a similar CUDA out-of-memory (OOM) issue as described in #10 while running on TITAN XP GPUs to run the pascal voc experiment with 92 labels. It occurred during the ASPP stage. To troubleshoot, I reduced the batch size to 1 and scaled down the decoder channels. Specifically, I made the following adjustments to the deocoder header:

channels: 128 -> 16
text_channels: 128 -> 16
up_channels: (64, 32) -> (8, 4)
skip_channels: (32, 16) -> (4, 2)
Hardcoded the GroupNorm layer size in the decoder to 1 These changes allowed me to pass the ASPP stage, but it OOMed again during the upscaling step.

Due to my limited VRAM, I would like to try out pre-calculating pseudo-labels for the CLIP guidance loss as suggested in the other thread. My question is how to do it? I’ve identified the forward_maskclip function in model/vlm.py as a potential candidate, but it appears to process weakly augmented images, which vary per iteration. I’m unsure how best to handle this variability when pre-calculating the labels.

Thanks for your help!

google-research / semivl

Pre-calculate Pseudo-labels for the CLIP Guidance Loss #20