Closed lucidrains closed 3 years ago
either use the RPE scheme as in the paper, or see if one can use rotary embeddings where possible, and T5 RPE bias in other places
the RPE used in the paper is good, explains why they use it to bias all other attention in the network
either use the RPE scheme as in the paper, or see if one can use rotary embeddings where possible, and T5 RPE bias in other places