Closed carlosep93 closed 4 years ago
Could you elaborate the request? What "original bpe" means in this context? It would be great if you show some example segmentation.
Thanks for answering!
I meant that all tokens are subwords of a single token, for example in bpe: an apple -> an app@@ le
But in sentencepiece sometimes I found that the final token comes from two different words in the original text. For example: an apple -> an_apple -> [an_app, le]
I want to match each word with linguistic information and I would need to ensure that each final token comes only from one word before tokenization,
Please --split_by_whtespace=true (the default is true) so we can always split sentence by whitespaces.
Hi,
I'm trying to employ sentence piece in a project but I would need subwords to belong to just one original token, like bpe does. It is supported by the library?
Thanks in advance,
Carlos