* feature request* subwords with a vocab of less than 5k?

tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

Apache License 2.0

15.34k stars 3.47k forks source link

* feature request* subwords with a vocab of less than 5k? #729

Open sebastian-nehrdich opened 6 years ago

sebastian-nehrdich commented 6 years ago

Hello, I am currently trying to get a transformer going for segmentation of scripta continua languages. I noticed that decreasing the vocab_size increased the performance of the transformer in this scenario. However when I try to set the subwords to less than 5k, it will gerate 5k of subwords anyway, no matter what parameters I give. So even with 2**10 it will generate a ~5k vocab instead of the expected 1k. Is this the intended behaviour?

martinpopel commented 6 years ago

You should try a character-based model for this (see e.g TranslateEndeWmtCharacters). The subword vocabulary must contain all characters from the training data and the subword algorithm does merges in "batches" (merging one after another would be too slow).

sebastian-nehrdich commented 6 years ago

Thank you for your reply! I have already trained a model character-based, currently I am just trying to make some tests how the performance goes with different vocab sizes such as 4k, 2k or 1k. Maybe I am going to try sentencepiece if it is not possible to have a smaller vocab with t2t internal subwords, becase sentencepiece seems to be capable to produce a vocab of 1k or 2k.