Closed nowittynamesleft closed 2 years ago
Need better tokenization; byte-pair encoding seems to be the way to get subword tokens from the data itself using frequencies of strings
seems to be working now, need to test different limits of frequencies of byte pairs to encode, as it is a hyperparameter to control
Need better tokenization; byte-pair encoding seems to be the way to get subword tokens from the data itself using frequencies of strings