Closed laisimiao closed 1 year ago
We truncate the text input and corresponding positional_embedding
to reduce memory/computations since the texts used in our cases are usually short (prompt + class name). The modification will not affect performance on downstream tasks like segmentation or detection, but the output might be slightly changed compared to the original implementation in CLIP.
OK, thanks.
I find that when you load pretrained CLIP parameters but tuncate context_length in positional_embedding like: https://github.com/raoyongming/DenseCLIP/blob/3b72447dee3f622f3716738140161ef9f763c72f/detection/denseclip/models.py#L652-L655
Does this affect pretrained model performance or in other words, does this change the pretrained model text encoder original output ?