google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

multi-thread batch encode seems slower than list comprehension #1039

Closed Mr-Grin closed 1 month ago

Mr-Grin commented 1 month ago

I tested encode speed in python with:

tokenizer_object = tokenizer.Tokenizer(model_config.tokenizer)

a = ['tctactctattatcatc',
'acgttttttttcgtactatcgatcgatcgatcgatc',
'gcatgcta'
]

def batch_encode_tokenizer():
    tokenizer_object.encode(a)

def encode_tokenizer():
    [tokenizer_object.encode(i) for i in a]

execution_time = timeit.timeit(batch_encode_tokenizer, number=100_000) / 100_000
print(tokenizer_object.encode(a))
print(f"Average execution time over 100_000 runs: {execution_time:.6f} seconds")

execution_time = timeit.timeit(encode_tokenizer, number=100_000) / 100_000
print([tokenizer_object.encode(i) for i in a])
print(f"Average execution time over 100_000 runs: {execution_time:.6f} seconds")

It seems that list comprehension is 10x faster than multi-thread batch encoding:

[[13, 256, 37, 448], [114, 31, 2139, 35, 1532, 335, 335, 335, 25], [19, 67, 8189, 8188]]
Average execution time over 100_000 runs: 0.000152 seconds
[[13, 256, 37, 448], [114, 31, 2139, 35, 1532, 335, 335, 335, 25], [19, 67, 8189, 8188]]
Average execution time over 100_000 runs: 0.000015 seconds

why is it? usually batched processing should be faster

taku910 commented 1 month ago

Due to the short input length and small batch size, the cost of generating threads is probably higher. Try with large amount of long sentences.