Closed szha closed 6 years ago
Thank you for the feedback.
It would be possible to change the boundary marker as long as keeping the semantics. However, '@@' marker used in subword-nmt has a different semantics -- it is a intra-word boundary maker. On the other hand, _ is sentencepiece is a just an escape character of whitespace.
SentencePiece: Hello world => He llo _wor ld
subword_nmt Hello world => He @@llo wor @@ld
'@@' style is based on the assumption of space-delimitered word and cannot encode all the information to reproduce the input. For instance, continuous spaces like "hello__world" cannot be represented with '@@' style.
The main advantage of SentencePiece is to handle raw (non segmented) text like Chinese/Japanese, so the style is more natural as is just a escaped character.
Ignoring the "lossless" property of these format, it is technically possible to support 'subword-nmt' style representation. Let me consider the feasibility of '@@' style.
I just found that '@@' style and _ style is not always convertible.
A _ B => "A B"
In this case, the token"_" is used as an independent character for whitespace which cannot be represented in '@@' style. Whitespace is implicitly defined in '@@' style.
In addition, "_" style is naturally extended to character-based segmentation by considering each piece is one character.
Please let me close this bug because it would be hard to adopt '@@' style at this moment. This would not be just a cosmetic change, but require some drastic change in training.
I understand. Thanks for the explanation.
It would be great if in sentencepiece the word boundary character can be chosen by the users. For example, '@@' is commonly seen in other libraries, so supporting that would help making it easier to integrate sentencepiece for existing datasets and models.