Closed YuHengKit closed 3 years ago
This is expected, because set_random_generator_seed just set the global seed of pseudo random generator.
By the way, the motivation of bpe-dropout and subword regularization is to produce multiple segmentation per epoch to virtually augment the training data. Your proposal does not fit this purpose, as we only have single result per one input.
I tried to run the code as below:
import sentencepiece as spm
spm.set_random_generator_seed(1)
spm.SentencePieceTrainer.train('--input=botchan.txt --model_type=bpe --vocab_size=10000 --model_prefix=bpe --pad_id=0 --unk_id=1 --eos_id=-1 --bos_id=-1--control_symbols=[CLS],[SEP],[MASK]--user_defined_symbols="(,),\",-,.,–,£,€"--shuffle_input_sentence=true --input_sentence_size=10000000--character_coverage=0.99995')
sp = spm.SentencePieceProcessor()
sp.load('bpe.model')
text='This is a test'
for _ in range(10):
x=sp.encode(text, out_type=int , enable_sampling=True, alpha=0.1, nbest_size=-1)
print(x)
the results are inconsistent as below:
[473, 9931, 23, 4, 2, 262]
[473, 96, 4, 2, 262]
[473, 96, 4, 2, 262]
[61, 9938, 23, 9931, 23, 4, 2, 43, 9933]
[386, 9937, 9939, 96, 9931, 9935, 2, 262]
[473, 96, 4, 2, 262]
[473, 96, 4, 2, 43, 9933]
[473, 96, 4, 9931, 9933, 262]
[473, 96, 4, 2, 262]
[61, 9938, 23, 96, 4, 2, 262]
I tried with many values of random seed but it will give different results too or any recommended random seed value range as I know the alpha=0.1. Expected to have same output results if we set the random seed. otherwise we are not able to produce the reproducible results. Thank you.