High frequency token segmented into letter sequence when input is a tsv file

TingxunShi commented 7 months ago

Hi,

Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:

echo "i am going to the park on monday" | spm_encode --model=sp-lm.model
▁ i ▁ a m ▁going ▁to ▁the ▁park ▁ o n ▁monday

However the frequency is:

i       144250667
am      5376197
on      79402723
park    1233890

The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is

spm_train \
    --input=counts \
    --input_format=tsv \
    --model_prefix=prefix \
    --vocab_size=48000 \
    --character_coverage=0.9999 \
    --num_threads=50 \
    --max_sentence_length=2048 \
    --normalization_rule_name=identity \
    --unk_surface="<unk>" \
    --train_extremely_large_corpus \
    --user_defined_symbols="<foo>,<bar>" \
    --byte_fallback \
    --split_digits

taku910 commented 7 months ago

This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.

The only workaround now is to write high frequency tokens in multiple times in the TSV file

TingxunShi commented 7 months ago

Thank you for the clarification. I also trained a model and set the model_type to BPE. Despite it being slightly slower, the results now appear much more in line with what is typically expected.

bauwenst commented 6 months ago

The only workaround now is to write high frequency tokens in multiple times in the TSV file

@taku910 Why is this the case? You would expect every word of the TSV file to be unique, so it's quite surprising that something special happens when you repeat a word.

google / sentencepiece

High frequency token segmented into letter sequence when input is a tsv file #967