google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.07k stars 1.16k forks source link

How to deal with id #1023

Open 980202006 opened 3 months ago

980202006 commented 3 months ago

I have some id values ​​and I want to train them with bpe.The following is an example of the id value.

26865, 5412, 26865, 26865, 26865, 26865, 5412, 5412, 25283, 26865, 3395, 26865, 3395, 19440, 25283, 3395, 24032, 1175, 3395, 3395, 3395, 26865, 1175, 26865, 15807, 15807, 27062, 27062, 26865, 4759, 26865, 26865, 27062, 1175, 1175, 1175, 382, 382, 382, 382, 27474, 23834, 29768, 11946, 11946, 27474, 17279

I want to extract the class [26865, 26865, ] as a vocabulary.

980202006 commented 3 months ago

If I use bpe, split_by_num will truncate the id value regardless of whether split_by_whitespace is selected or not. print(sp.id_to_piece(111)) #65, 26

azimjonn commented 3 months ago

https://github.com/google/sentencepiece/blob/master/doc/options.md#:~:text=%2D%2Dsplit_by_number%20(split%20tokens%20by%20numbers%20(0%2D9))%20%20type%3A%20bool%20default%3A%20true%0A%20%20%20%2D%2Dsplit_by_whitespace%20(use%20a%20white%20space%20to%20split%20sentence%20pieces)%20%20type%3A%20bool%20default%3A%20true%0A%20%20%20%2D%2Dsplit_digits%20(split%20all%20digits%20(0%2D9)%20into%20separate%20pieces)%20%20type%3A%20bool%20default%3A%20false

980202006 commented 3 months ago

@azimjonn Could you give detailed configuration? The URL you gave is the default configuration.