The paper (and focalTv1) implements a window size of 7 with expand size of 3 to match a 13x13 zone, in your implementation in v2 expand_size is never used (even if it is declared), I believe that it is because you replaced it by topK closest position (which represent a zone of sqrt(128)xsqrt(128)) in your config. Am I right ? And you are then projecting the topK coordinates using a Linear layer, right ? Have you tested the model with this configuration ?
The paper (and focalTv1) implements a window size of 7 with expand size of 3 to match a 13x13 zone, in your implementation in v2 expand_size is never used (even if it is declared), I believe that it is because you replaced it by topK closest position (which represent a zone of sqrt(128)xsqrt(128)) in your config. Am I right ? And you are then projecting the topK coordinates using a Linear layer, right ? Have you tested the model with this configuration ?