j0ma / morph-seg

morphological / word segmentation experiments
1 stars 0 forks source link

todo: lmvr output lexicon size #6

Open j0ma opened 4 years ago

j0ma commented 4 years ago

LMVR modifies FlatCat and allows for an output lexicon size to be set.

Since we used 3 different settings for BPE (2500, 5000, 7500), it could be worthwhile to investigate the settings for LMVR as well. It's non-obvious whether the vocab size will matter or not when not using BPE.

j0ma commented 4 years ago

Based on the observations below, it seems like BPE=5000 in Flores means that there will be an output vocabulary of size 5000. Therefore, it would make sense to start the LMVR experiments with output lexicon size equal to 5000 as well.

subword-nmt

└─[$] subword-nmt learn-bpe --help                                                                                                      [12:52:32]
usage: subword-nmt learn-bpe [-h] [--input PATH] [--output PATH]
                             [--symbols SYMBOLS] [--min-frequency FREQ]
                             [--dict-input] [--total-symbols] [--verbose]

learn BPE-based word segmentation

[...]

  --symbols SYMBOLS, -s SYMBOLS
                        Create this many new symbols (each representing a
                        character n-gram) (default: 10000))
  --total-symbols, -t   subtract number of characters from the symbols to be
                        generated (so that '--symbols' becomes an estimate for
                        the total number of symbols needed to encode text).

sentencepiece

From the documentation:

> What is SentencePiece?

[...]

> The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.