google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.27k stars 1.18k forks source link

Whitespace scaping #423

Closed carlosep93 closed 4 years ago

carlosep93 commented 4 years ago

Hi,

I'm trying to employ sentence piece in a project but I would need subwords to belong to just one original token, like bpe does. It is supported by the library?

Thanks in advance,

Carlos

taku910 commented 4 years ago

Could you elaborate the request? What "original bpe" means in this context? It would be great if you show some example segmentation.

carlosep93 commented 4 years ago

Thanks for answering!

I meant that all tokens are subwords of a single token, for example in bpe: an apple -> an app@@ le

But in sentencepiece sometimes I found that the final token comes from two different words in the original text. For example: an apple -> an_apple -> [an_app, le]

I want to match each word with linguistic information and I would need to ensure that each final token comes only from one word before tokenization,

taku910 commented 4 years ago

Please --split_by_whtespace=true (the default is true) so we can always split sentence by whitespaces.