Closed JamesDConley closed 3 years ago
I am also very interested in this question. Is it possible to train a model with a dataset which contains various languages (shuffled) instead of one? Does the parameter --accept-language="de,en,fr,zh,..." help?
Thank you very much.
It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
It is possible to train single model from data set containing various languages. By the way, --accept_language
is not used in the training and testing phase. These languages are kept as references in the model file for debugging.
what about bpe model? How can we combine two bpe model?
what about bpe model? How can we combine two bpe model?
same question.
- It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
This seems like a nice solution. How do you deal with renormalizing the probability distribution, though? The XLM-V paper and another nice one from EMNLP 2020 on which it is based say variants of, "[we] create the final multilingual vocabulary by taking the union of the vocabularies for each cluster." [the clusters are SPM vocabularies trained separately], but it is not clear how this union is done.
First, for
renormalizing the probability distribution
the authors of the XLM-V paper averages any shared vocabulary.
Secondly for the EMNLP 2020 paper, actually the XLM-V paper explains it much better than the original paper imo.
- It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
- It is possible to train single model from data set containing various languages. By the way,
--accept_language
is not used in the training and testing phase. These languages are kept as references in the model file for debugging.
Hi, I trained three sentencepiece files separately, one from a word frequency file (for English) and two from plain text (for Chinese), all of them are trained using default model type (unigram). I did manage to merge the SP models following the ipynb example you provided, but the behaviour is quite unexpected: English sentences are segmented into smaller units. For example:
echo "It has been three years since I worked here." | spm_encode --model=en.model
▁It ▁has ▁been ▁three ▁years ▁since ▁ I ▁worked ▁here .
echo "It has been three years since I worked here." | spm_encode --model=enzh.model
▁I t ▁has ▁be en ▁three ▁ ye a rs ▁s in ce ▁I ▁worked ▁here .
I would appreciate your assistance in identifying what I might have done wrong. Thank you!
- It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
- It is possible to train single model from data set containing various languages. By the way,
--accept_language
is not used in the training and testing phase. These languages are kept as references in the model file for debugging.Hi, I trained three sentencepiece files separately, one from a word frequency file (for English) and two from plain text (for Chinese), all of them are trained using default model type (unigram). I did manage to merge the SP models following the ipynb example you provided, but the behaviour is quite unexpected: English sentences are segmented into smaller units. For example:
echo "It has been three years since I worked here." | spm_encode --model=en.model ▁It ▁has ▁been ▁three ▁years ▁since ▁ I ▁worked ▁here . echo "It has been three years since I worked here." | spm_encode --model=enzh.model ▁I t ▁has ▁be en ▁three ▁ ye a rs ▁s in ce ▁I ▁worked ▁here .
I would appreciate your assistance in identifying what I might have done wrong. Thank you!
I also face this problem as well.
I have vocab.models and vocab.vocabs for a number of separate languages, is it possible to combine these into one vocab model which I could then apply to the whole set?