Open TingxunShi opened 7 months ago
This is a very difficult bug to fix. This frequency information is not used when extracting the initial seed tokens. This is due to the fact that frequencies are not available to extract high frequency substrings in the Suffix Array.
The only workaround now is to write high frequency tokens in multiple times in the TSV file
Thank you for the clarification. I also trained a model and set the model_type
to BPE. Despite it being slightly slower, the results now appear much more in line with what is typically expected.
The only workaround now is to write high frequency tokens in multiple times in the TSV file
@taku910 Why is this the case? You would expect every word of the TSV file to be unique, so it's quite surprising that something special happens when you repeat a word.
Hi,
Currently I need to train a SP model. Since the input files are too large (~80G, 500M lines), I set up the program to use tsv file instead of raw text files as input. The format of input file is simply "word\tfrequency". I trained the model using unigram-lm. After training, I found some short but high frequency words are tokenized into character sequences. Below is an example:
However the frequency is:
The result is somewhat counterintuitive. Did I make any mistake or is it the algorithm indeed does or something else? The word freq file was generated from an untokenized dataset, has ~60M lines. My training command is