Open atlas-sky opened 1 week ago
We actually implemented both a transformed and a direct version of self-attention. This specific version was intended as an initial, quick validation to focus on the core attention mechanism, where key, query, and value are directly used without transformations. However, as described in our paper, the main implementation uses learnable parameters to transform K, Q, and V. You can check the latest code update to select the version that best fits your needs.
Looking at the code, it seems that there are no weights for key, query, and value when implementing self attention. Is this the correct implementation?