Hello, I read the paper about GPT 2 : it says that they used BPE on a sequence of bytes and that they only needed a vocab size of 256. I researched the internet but didn't find any explanation on how BPE on sequence of bytes work and why the 256 vocab size. I am confusing since I don't know how this works compared to applying BPE on normal characters and what are the clear motivations since they also say that character/byte level LMs don't work great. How this is different.
THANKS.
Hello, I read the paper about GPT 2 : it says that they used BPE on a sequence of bytes and that they only needed a vocab size of 256. I researched the internet but didn't find any explanation on how BPE on sequence of bytes work and why the 256 vocab size. I am confusing since I don't know how this works compared to applying BPE on normal characters and what are the clear motivations since they also say that character/byte level LMs don't work great. How this is different. THANKS.