google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.04k stars 1.16k forks source link

Suppression of isolated ▁'s #184

Closed feddyfedfed closed 6 years ago

feddyfedfed commented 6 years ago

To maximize likelihood, there are cases where subword tokens that are seen after a space (meaning the start of a "word") do not get the special underscore because they also appear in the middle of a character combination somewhere else.

Is it possible to suppress this behavior? Meaning, we don't want to have an isolated ▁ as part of the generated vocab list.

taku910 commented 6 years ago

Technically possible especially in Latin-based languages, but I would like to know why this behavior is necessary.

Another concern is that in the languages without explicit word boundaries (Chinese/Japanese) rarely appear "_" symbol so it is natural to handle them as one independent symbol.

feddyfedfed commented 6 years ago

We are using the segmented text data for a speech recognition task and in order to honor the language model probabilities, we need to incorporate the symbol in the vocabulary. We use silence as the symbol's pronunciation to allow our search to still include them, but we wish to also experiment not having it. In any case, if it's gonna be too laborious we can just manually edit our segmented texts. (But I realized, simply editing the segmented texts would result to a change in the vocabulary...)

On a more general concern, what is the canonical way of telling SentencePiece NOT to segment between certain character combinations. Like for example in Japanese, a contrived example for glides: キャラクター will be segmented to キ ャラクター (again, this is just a contrived example, but it happens to other words). For a speech recognition task this is problematic since we have models for the complete phoneme "ky"

taku910 commented 6 years ago

Thank you for the explanation.

There is a code block to filter invalid piece here. https://github.com/google/sentencepiece/blob/master/src/trainer_interface.cc#L107

You can add the following if-then block .

if (*it == kWSChar && sentencepiece.size() == 1) { return false; }

Then, the isolated "_" will not be handled as a piece. Note that in this case, if the input sentence contains "foo_bar" and "b" is oov, the output my include independent "", which is handled as unknown symbol.

We will add a flag to enable this behavior.

feddyfedfed commented 6 years ago

Again, many thanks Takuさん.

taku910 commented 6 years ago

I found that the hack I showed doesn't work, and the fix for your proposal is a little tricky especially for BPE segmentation.

BPE iteratively concats two frequent symbols to make new symbols, which means that the two symbols before merging must be a valid token. So, suppressing only "_" won't work in BPE segmentation.

In unigram-based segmentation, it is possible, but the many code for vocab filtering is shared by BPE and unigram.

At this moment, we would not like to implement this feature given the concern that the code becomes too complicated. Thank you for the understanding.

feddyfedfed commented 6 years ago

No problem. At the moment, our workaround of assigning silence models to the special underscore seems to work fine. :) Thank you for considering, though.