Custom Word Boundary Sequence

szha commented 6 years ago

It would be great if in sentencepiece the word boundary character can be chosen by the users. For example, '@@' is commonly seen in other libraries, so supporting that would help making it easier to integrate sentencepiece for existing datasets and models.

taku910 commented 6 years ago

Thank you for the feedback.

It would be possible to change the boundary marker as long as keeping the semantics. However, '@@' marker used in subword-nmt has a different semantics -- it is a intra-word boundary maker. On the other hand, _ is sentencepiece is a just an escape character of whitespace.

SentencePiece:  Hello world => He llo _wor ld
subword_nmt     Hello world => He @@llo wor @@ld

'@@' style is based on the assumption of space-delimitered word and cannot encode all the information to reproduce the input. For instance, continuous spaces like "hello__world" cannot be represented with '@@' style.

The main advantage of SentencePiece is to handle raw (non segmented) text like Chinese/Japanese, so the style is more natural as is just a escaped character.

taku910 commented 6 years ago

Ignoring the "lossless" property of these format, it is technically possible to support 'subword-nmt' style representation. Let me consider the feasibility of '@@' style.

taku910 commented 6 years ago

I just found that '@@' style and _ style is not always convertible.

A _ B => "A B"

In this case, the token"_" is used as an independent character for whitespace which cannot be represented in '@@' style. Whitespace is implicitly defined in '@@' style.

In addition, "_" style is naturally extended to character-based segmentation by considering each piece is one character.

Please let me close this bug because it would be hard to adopt '@@' style at this moment. This would not be just a cosmetic change, but require some drastic change in training.

szha commented 6 years ago

I understand. Thanks for the explanation.

google / sentencepiece

Custom Word Boundary Sequence #224