Open KyunHwan opened 3 months ago
Another resolution (1536 x 1152) was tested with multiplication factor of (512/1536) for t with appropriate results. So far this works well for objects that have "good number" of features.
How to modify the parameter t?
def get_cos_sin(self, D, seq_len, device, dtype): if (D,seq_len,device,dtype) not in self.cache: inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D)) t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype) freqs = torch.cat((freqs, freqs), dim=-1) cos = freqs.cos() # (Seq, Dim) sin = freqs.sin() self.cache[D,seq_len,device,dtype] = (cos,sin) return self.cache[D,seq_len,device,dtype]
How to modify the parameter t?
def get_cos_sin(self, D, seq_len, device, dtype): if (D,seq_len,device,dtype) not in self.cache: inv_freq = 1.0 / (self.base ** (torch.arange(0, D, 2).float().to(device) / D)) t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) freqs = torch.einsum("i,j->ij", t, inv_freq).to(dtype) freqs = torch.cat((freqs, freqs), dim=-1) cos = freqs.cos() # (Seq, Dim) sin = freqs.sin() self.cache[D,seq_len,device,dtype] = (cos,sin) return self.cache[D,seq_len,device,dtype]
if you're going from 512 to 1024, you would do: t = torch.arange(seq_len, device=device, dtype=inv_freq.dtype) * (512/1024)
@KyunHwan thanks
Using the default setup, large input images were being resized to 512 x 384 (using DUSt3R_ViTLarge_BaseDecoder_512_dpt.pth). But I wanted results with higher resolution (1024 x 768). So I followed "Extending Context Window of Large Language Models via Position Interpolation" by Meta and changed only the default image_size value of 512 to 1024 inside demo.py and multiplied the variable t inside get_cos_sin method of RoPE2D of croco/models/pos_embed.py by (512/1024). This gave pretty good results, though finetuning is most likely required for better results.