google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.08k stars 1.16k forks source link

Same oov count while using different vocab size #960

Closed shreyasinghal-17 closed 8 months ago

shreyasinghal-17 commented 8 months ago

To analyze how vocabulary size affects the OOV count : I took a train corpus of 99 lakh sentences (in Malayalam). having 378649 unique words, and a test set of 1000 sentences with 5322 unique words. 3864 words are common in both the train set and the test set. On encoding the test set, I'm getting an unk count of 1885, which is greater than the number of unseen words(5322-3864=1458).

To better analyze the result, I tried doing the same exercise in Hindi. train corpus size: 43 lakhs unique words: 93k test corpus size: 1000 unique words: 5517 number of common words in both the test and train sets: 4173 unseen words: 1344 In this case, the unk count after encoding is 8. Also, I observed that whatever the vocabulary size I took (e.g., 250,500,4k,8k, 16k, 32k, 50k, 1lakh), the number of unks stayed the same in all.

Would you kindly assist me in figuring out where I might be making a mistake or in better comprehending these results?

for training i used this : spm_train --input=train_corpus.csv --model_prefix=trained_models_unigram/trained_model_vs_32k --vocab_size=32000 -character_coverage=1.0

for encoding: spm_encode --model=trained_models_unigram/trained_model_vs_32k.model --output_format=piece --extra_options=unk <test_corpus_2.csv> trained_models_unigram/op_vs_32k

taku910 commented 8 months ago

Since subword is a technique introduced to eliminate OOV, it is normal behavior for OOV not to vary with vocabulary size. When --character_coverage=1.0 is specified, all characters are used as in-vocab, so the OOVs are the set of characters that don't appear in the test corpus, which are not affected by the vocab size configuration.

shreyasinghal-17 commented 8 months ago

If I trained the model with --character_coverage=1, and it essentially means it has each char as part of it's vocab, shouldn't the unk count be 0.