Closed sarubi closed 3 years ago
I even tried to install python wrapper as mentioned after building the SentencePiece C++:
To build and install the Python wrapper from source, please install SentencePiece C++ and try the following commands:
% python setup.py build
% sudo python setup.py install
And then later followed the same example way:
import sentencepiece as spm3
# train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
# `m.vocab` is just a reference. not used in the segmentation.
spm3.SentencePieceTrainer.train('--input=10_fairseq/zwj/parallel-05-08-2020.tu.tok.cl6.si-en-ta.si,10_fairseq/zwj/parallel-05-08-2020.tu.tok.cl6.si-en-ta.en --model_prefix=si --vocab_size=80')
# makes segmenter instance and loads the model file (m.model)
sp = spm3.SentencePieceProcessor()
sp.load('si.model')
# encode: text => id
print(sp.encode_as_pieces('ගරු සජිත් ප්රේමදාස '))
print(sp.encode_as_ids('ගරු සජිත් ප්රේමදාස '))
this ended up with the following attribute error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-23-fa51b060cb4c> in <module>()
3 # train sentencepiece model from `botchan.txt` and makes `m.model` and `m.vocab`
4 # `m.vocab` is just a reference. not used in the segmentation.
----> 5 spm3.SentencePieceTrainer.train('--input=10_fairseq/zwj/parallel-05-08-2020.tu.tok.cl6.si-en-ta.si,10_fairseq/zwj/parallel-05-08-2020.tu.tok.cl6.si-en-ta.en --model_prefix=si --vocab_size=80')
6
7 # makes segmenter instance and loads the model file (m.model)
AttributeError: module 'sentencepiece' has no attribute 'SentencePieceTrainer'
Seems changes are not picked correctly which results in an error?
Merged the new normalization rule in v0.19.6.
@taku910 First of all thanks for your support on this. Even though changes are merged https://github.com/google/sentencepiece/pull/630 to keep the zero-width joiner as it is, instead of replacing it with whitespace, but still the #629 mentioned issue is there, Zero width joiner is getting replaced by whitespace.
I followed the same procedure as mentioned to Build and install SentencePiece command-line tools from C++ source then train my spm model as given below.
logs:
Then to train spm:
getting results as below for the input "ගරු සජිත් ප්රේමදාස මහතා":
▁ගරු ▁සජිත් ▁ප් ▁රේමදාස ▁මහතා
ideally it should be:▁ගරු ▁සජිත් ▁ප් රේමදාස ▁මහතා
Could you please let me know where/how this went wrong?
additional: refer to my collab-notebook here: https://colab.research.google.com/drive/1MkTFcfcTUTVm3rOdUKoVtvCkTwXr7oWT?usp=sharing