google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.12k stars 1.17k forks source link

Combine vocabularies from various languges #588

Closed JamesDConley closed 3 years ago

JamesDConley commented 3 years ago

I have vocab.models and vocab.vocabs for a number of separate languages, is it possible to combine these into one vocab model which I could then apply to the whole set?

mh1337 commented 3 years ago

I am also very interested in this question. Is it possible to train a model with a dataset which contains various languages (shuffled) instead of one? Does the parameter --accept-language="de,en,fr,zh,..." help?

Thank you very much.

taku910 commented 3 years ago
  1. It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

  2. It is possible to train single model from data set containing various languages. By the way, --accept_language is not used in the training and testing phase. These languages are kept as references in the model file for debugging.

Sar-Dar commented 2 years ago

what about bpe model? How can we combine two bpe model?

ftgreat commented 1 year ago

what about bpe model? How can we combine two bpe model?

same question.

kellymarchisio commented 1 year ago
  1. It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

This seems like a nice solution. How do you deal with renormalizing the probability distribution, though? The XLM-V paper and another nice one from EMNLP 2020 on which it is based say variants of, "[we] create the final multilingual vocabulary by taking the union of the vocabularies for each cluster." [the clusters are SPM vocabularies trained separately], but it is not clear how this union is done.

chris-ha458 commented 1 year ago

First, for

renormalizing the probability distribution

the authors of the XLM-V paper averages any shared vocabulary.

Secondly for the EMNLP 2020 paper, actually the XLM-V paper explains it much better than the original paper imo.

TingxunShi commented 7 months ago
  1. It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
  2. It is possible to train single model from data set containing various languages. By the way, --accept_language is not used in the training and testing phase. These languages are kept as references in the model file for debugging.

Hi, I trained three sentencepiece files separately, one from a word frequency file (for English) and two from plain text (for Chinese), all of them are trained using default model type (unigram). I did manage to merge the SP models following the ipynb example you provided, but the behaviour is quite unexpected: English sentences are segmented into smaller units. For example:

echo "It has been three years since I worked here." | spm_encode --model=en.model
▁It ▁has ▁been ▁three ▁years ▁since ▁ I ▁worked ▁here .
echo "It has been three years since I worked here." | spm_encode --model=enzh.model
▁I t ▁has ▁be en ▁three ▁ ye a rs ▁s in ce ▁I ▁worked ▁here .

I would appreciate your assistance in identifying what I might have done wrong. Thank you!

thananchaiktw commented 5 months ago
  1. It is theoretically possible to combine multiple model files into one model files when model_type=unigram (default) The model file is stored as protobuf. You might want to simply merge the 'pieces' fields. Please see this page to manually edit the model file. https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
  2. It is possible to train single model from data set containing various languages. By the way, --accept_language is not used in the training and testing phase. These languages are kept as references in the model file for debugging.

Hi, I trained three sentencepiece files separately, one from a word frequency file (for English) and two from plain text (for Chinese), all of them are trained using default model type (unigram). I did manage to merge the SP models following the ipynb example you provided, but the behaviour is quite unexpected: English sentences are segmented into smaller units. For example:

echo "It has been three years since I worked here." | spm_encode --model=en.model
▁It ▁has ▁been ▁three ▁years ▁since ▁ I ▁worked ▁here .
echo "It has been three years since I worked here." | spm_encode --model=enzh.model
▁I t ▁has ▁be en ▁three ▁ ye a rs ▁s in ce ▁I ▁worked ▁here .

I would appreciate your assistance in identifying what I might have done wrong. Thank you!

I also face this problem as well.