Open j0ma opened 4 years ago
Based on the observations below, it seems like BPE=5000
in Flores means that there will be an output vocabulary of size 5000. Therefore, it would make sense to start the LMVR experiments with output lexicon size equal to 5000 as well.
└─[$] subword-nmt learn-bpe --help [12:52:32]
usage: subword-nmt learn-bpe [-h] [--input PATH] [--output PATH]
[--symbols SYMBOLS] [--min-frequency FREQ]
[--dict-input] [--total-symbols] [--verbose]
learn BPE-based word segmentation
[...]
--symbols SYMBOLS, -s SYMBOLS
Create this many new symbols (each representing a
character n-gram) (default: 10000))
--total-symbols, -t subtract number of characters from the symbols to be
generated (so that '--symbols' becomes an estimate for
the total number of symbols needed to encode text).
symbols
refers to characters + learned bpe chunks if --total-symbols not specifiedBPE
class there is a merges
argument
From the documentation:
> What is SentencePiece?
[...]
> The number of unique tokens is predetermined
Neural Machine Translation models typically operate with a fixed vocabulary. Unlike most unsupervised word segmentation algorithms, which assume an infinite vocabulary, SentencePiece trains the segmentation model such that the final vocabulary size is fixed, e.g., 8k, 16k, or 32k.
Note that SentencePiece specifies the final vocabulary size for training, which is different from subword-nmt that uses the number of merge operations. The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
LMVR modifies FlatCat and allows for an output lexicon size to be set.
Since we used 3 different settings for BPE (2500, 5000, 7500), it could be worthwhile to investigate the settings for LMVR as well. It's non-obvious whether the vocab size will matter or not when not using BPE.