tokenizer_object = tokenizer.Tokenizer(model_config.tokenizer)
a = ['tctactctattatcatc',
'acgttttttttcgtactatcgatcgatcgatcgatc',
'gcatgcta'
]
def batch_encode_tokenizer():
tokenizer_object.encode(a)
def encode_tokenizer():
[tokenizer_object.encode(i) for i in a]
execution_time = timeit.timeit(batch_encode_tokenizer, number=100_000) / 100_000
print(tokenizer_object.encode(a))
print(f"Average execution time over 100_000 runs: {execution_time:.6f} seconds")
execution_time = timeit.timeit(encode_tokenizer, number=100_000) / 100_000
print([tokenizer_object.encode(i) for i in a])
print(f"Average execution time over 100_000 runs: {execution_time:.6f} seconds")
It seems that list comprehension is 10x faster than multi-thread batch encoding:
[[13, 256, 37, 448], [114, 31, 2139, 35, 1532, 335, 335, 335, 25], [19, 67, 8189, 8188]]
Average execution time over 100_000 runs: 0.000152 seconds
[[13, 256, 37, 448], [114, 31, 2139, 35, 1532, 335, 335, 335, 25], [19, 67, 8189, 8188]]
Average execution time over 100_000 runs: 0.000015 seconds
why is it? usually batched processing should be faster
I tested encode speed in python with:
It seems that list comprehension is 10x faster than multi-thread batch encoding:
why is it? usually batched processing should be faster