I noticed the masks were pretty narrow when visualizing the inputs during training. The paper mentions that they use 70-100% masks for each sequence in the batch (like Voicebox), so this updates mask generation to handle that.
(I'm not super solid on the tensor typing so feel free to fix/edit if you want to take this!)
I noticed the masks were pretty narrow when visualizing the inputs during training. The paper mentions that they use 70-100% masks for each sequence in the batch (like Voicebox), so this updates mask generation to handle that.
(I'm not super solid on the tensor typing so feel free to fix/edit if you want to take this!)