todo: lmvr output lexicon size

Based on the observations below, it seems like BPE=5000 in Flores means that there will be an output vocabulary of size 5000. Therefore, it would make sense to start the LMVR experiments with output lexicon size equal to 5000 as well.

subword-nmt

└─[$] subword-nmt learn-bpe --help                                                                                                      [12:52:32]
usage: subword-nmt learn-bpe [-h] [--input PATH] [--output PATH]
                             [--symbols SYMBOLS] [--min-frequency FREQ]
                             [--dict-input] [--total-symbols] [--verbose]

learn BPE-based word segmentation

[...]

  --symbols SYMBOLS, -s SYMBOLS
                        Create this many new symbols (each representing a
                        character n-gram) (default: 10000))
  --total-symbols, -t   subtract number of characters from the symbols to be
                        generated (so that '--symbols' becomes an estimate for
                        the total number of symbols needed to encode text).

based on this it seems like symbols refers to characters + learned bpe chunks if --total-symbols not specified
original vocabulary is thrown out, i'm assuming
in the BPE class there is a merges argument
- ought to be same as symbols?

sentencepiece

From the documentation:

> What is SentencePiece?

[...]

> The number of unique tokens is predetermined

Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.

Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.

It's pretty clear that SentencePiece throws away the original vocab and only uses the learned chunks.

j0ma / morph-seg

todo: lmvr output lexicon size #6

subword-nmt

sentencepiece