google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.09k stars 1.16k forks source link

Ability to avoid rare segmentation causing UNKs #87

Closed ozancaglayan closed 6 years ago

ozancaglayan commented 6 years ago

When training a joint SPM model on two or more languages, is there a way to alleviate the problem of segmenting a token in language1 into subunits seen in language2 causing UNKs during test-time?

In subword-nmt, there's a vocabulary threshold for this that allows further segmentation of tokens until the subunits are seen with at least that threshold times in the relevant language.

taku910 commented 6 years ago

Thank you for the suggestion. I will implement this feature.

taku910 commented 6 years ago

Hi,

I've just implemented the feature of restricting vocabulary. The usage is basically the same as subword-nmt.

% spm_encode --generate_vocabulary --model=spm.model < input.L1 > vocab.L1 % spm_encode --generate_vocabulary --model=spm.model < input.L2 > vocab.L2 % spm_encode --vocabulary=vocab.L1 --vocabulary_threshold=50 --model=spm.model < input > input.seg % spm_encode --vocabulary=vocab.L2 --vocabulary_threshold=50 --model=spm.model < input > input.seg

I will update the document soon.

ozancaglayan commented 6 years ago

That was quick, thanks!

I have one side question: Normally the spm_train generates a vocabulary file as well. Where is that file used in the pipeline? Because once we encode our training set with an SPM model, what we generally do is to let the NMT toolkit create its own vocabulary file, right? So until the above modification, was that .vocab file solely there for convenience or statistics?

taku910 commented 6 years ago

SentencePiece (spm_(encode|decode)) doesn't use .vocab file, as the same information is stored in the .model file. The vocab file is emitted just for reference. MT toolkit can load the vocab file to manage vocab. spm_encode can directly generate id sequence which can bypass the vocab management in NMT toolkilt.

One note is that *vocab file doesn't contain all the unique tokens in the training corpus. spm_train first makes the valid character set so that the character-level coverage gets 99.95% (this can be configured by --character_coverage=1.0 parameter). This workaround is introduced for handling CJK languages that have larger character coverage, and may contain many noisy symbols depending on the corpus. The rare characters in the corpus are handled as UNK and not in the .vocab file.

taku910 commented 6 years ago

Updated the document. https://github.com/google/sentencepiece/blob/master/README.md#vocabulary-restriction

Let me close this issue. Thank you!

hoangcuong2011 commented 3 years ago

@taku910:

Hi, In subword-nmt there are two types of tokens: one is the normal one (e.g. turn) and the another one is the one that is part of another longer token (e.g. turnarround). With that, we need to handle not only normal tokens, say, turn, but also turn@@ as this could work to segment a word turnarround for instance (e.g. turn@@ arround), assuming that turnarround does not appear in the merge operations.

As a result the size of true vocab collected from the encoded training dataset for NMT to handle is larger than the number of merge operations in subword.

I noticed this seems not the case with sentencepiece. Specifically when I count the unique ID in the encoded training dataset, it is the same size with the merge operations. So I am curious how you handle this (@@ part) if you can share?

Thanks.