I train the UNIGRAM model on a large corpus in the tsv "sentence tab frequency" format. The input is structured, where the alphabet consists of 8K characters, and all the words are length 4. The resulting number even of possible trigrams on the first three symbols is in the billions, but somehow the number of the resulting seed sentencepieces is ~30M. Because of this, I achieve a very low compression rate on the corpus compared to BPE with the same vocabulary size.
I train the UNIGRAM model on a large corpus in the tsv "sentence tab frequency" format. The input is structured, where the alphabet consists of 8K characters, and all the words are length 4. The resulting number even of possible trigrams on the first three symbols is in the billions, but somehow the number of the resulting seed sentencepieces is ~30M. Because of this, I achieve a very low compression rate on the corpus compared to BPE with the same vocabulary size.
Here is the train config
Relevant log piece:
Any ideas for why the resulting number of seed sentencepieces is so low?