nowittynamesleft / protein_function_description

Describing protein sets' functions with natural language.
0 stars 0 forks source link

byte-pair encoding #15

Closed nowittynamesleft closed 2 years ago

nowittynamesleft commented 2 years ago

Need better tokenization; byte-pair encoding seems to be the way to get subword tokens from the data itself using frequencies of strings

nowittynamesleft commented 2 years ago

seems to be working now, need to test different limits of frequencies of byte pairs to encode, as it is a hyperparameter to control