google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.28k stars 1.18k forks source link

spm.set_random_generator_seed didnt have expected output #609

Closed YuHengKit closed 3 years ago

YuHengKit commented 3 years ago

I tried to run the code as below: import sentencepiece as spm spm.set_random_generator_seed(1) spm.SentencePieceTrainer.train('--input=botchan.txt --model_type=bpe --vocab_size=10000 --model_prefix=bpe --pad_id=0 --unk_id=1 --eos_id=-1 --bos_id=-1--control_symbols=[CLS],[SEP],[MASK]--user_defined_symbols="(,),\",-,.,–,£,€"--shuffle_input_sentence=true --input_sentence_size=10000000--character_coverage=0.99995') sp = spm.SentencePieceProcessor() sp.load('bpe.model') text='This is a test' for _ in range(10): x=sp.encode(text, out_type=int , enable_sampling=True, alpha=0.1, nbest_size=-1) print(x)

the results are inconsistent as below: [473, 9931, 23, 4, 2, 262] [473, 96, 4, 2, 262] [473, 96, 4, 2, 262] [61, 9938, 23, 9931, 23, 4, 2, 43, 9933] [386, 9937, 9939, 96, 9931, 9935, 2, 262] [473, 96, 4, 2, 262] [473, 96, 4, 2, 43, 9933] [473, 96, 4, 9931, 9933, 262] [473, 96, 4, 2, 262] [61, 9938, 23, 96, 4, 2, 262]

I tried with many values of random seed but it will give different results too or any recommended random seed value range as I know the alpha=0.1. Expected to have same output results if we set the random seed. otherwise we are not able to produce the reproducible results. Thank you.

taku910 commented 3 years ago

This is expected, because set_random_generator_seed just set the global seed of pseudo random generator.

By the way, the motivation of bpe-dropout and subword regularization is to produce multiple segmentation per epoch to virtually augment the training data. Your proposal does not fit this purpose, as we only have single result per one input.