google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.22k stars 1.17k forks source link

How to create new model file with restricted vocabulary? #522

Open sshleifer opened 4 years ago

sshleifer commented 4 years ago

Similar to #474, I want to restrict my vocabulary, and then save a new model file that uses the restricted vocabulary.

I tried to do this by saving a vocabulary, modifying it, and then figuring out how to save the restricted model, but I found that even without any modification, running spm_export_vocab follow by spm_encode --vocabulary produces different results.

For example,

echo "Șeful ONU declară că nu există soluții militare în Siria" | spm_encode --model enro_trimmed/sentence.bpe.model
= > ▁Ș e ful ▁ONU ▁de cla ră ▁că ▁nu ▁există ▁solu ții ▁militare ▁în ▁Siria
spm_export_vocab --model enro_trimmed/sentence.bpe.model  --output=sp_vocab.txt
 echo "Șeful ONU declară că nu există soluții militare în Siria" | spm_encode --model enro_trimmed/sentence.bpe.model --vocabulary sp_vocab.txt
=> ▁ Ș e f u l ▁ O N U ▁ d e c l a r ă ▁ c ă ▁ n u ▁ e x i s t ă ▁ s o l u ț i i ▁ m i l i t a r e ▁ î n ▁ S i r i a

Is this expected behavior? My end goal is that in python, spm.encode_as_ids only produces ids < length of the restricted vocab, so if there is a more direct way to achieve that objective I would love to know it!

Thanks!

sshleifer commented 4 years ago

Trying to workaround, I made an ordered list of pieces that I want to keep like

 score: 0.0
 type: UNKNOWN,
 piece: "<s>"
 score: 0.0
 type: CONTROL,
 piece: "</s>"
 score: 0.0
 type: CONTROL,
 piece: ","
 score: -3.4635426998138428,
 piece: "."
 score: -3.625642776489258,
...
]

Then I try to make a new model and with those pieces:


sp_new = sentencepiece_model_pb2.ModelProto()
sp_new.pieces = new_pieces

I get

AttributeError: Assignment not allowed to repeated field "pieces" in protocol message object.

Is this second approach the right way to do this? surely somebody besides me must have tried to restrict a sentencepiece model from using certain pieces before?

theoqian commented 4 years ago

I have the same demand with yours to save a new model with restricted vocabulary. But it seems that sentencepiece doesn't provide such APIs in python. SetVocabulary doesn't change the model. Looking forward to new APIs for saving a new model and changing the real vocabulary of a model.

sshleifer commented 4 years ago

~Does SetVocabulary do anything? Do you have an example of how to use it?~ SetVocabulary example: https://github.com/google/sentencepiece/issues/250

gmryu commented 2 years ago

Not sure if this is still needed.

I manage to create a m.piece by copy.deepcopy(m.pieces[0]) and by using it I can create a new spm. I used it like this:

def new_piece_by_deepcopy(original_piece,token:str,score:float,piece_type:int):
    '''
    Args:
        original_piece:(SentencePiece) the target of deepcopy
        piece:(str) token
        score:(float) priority of encoding to this token (see spm.vocab). 
        piece_type:(int) 1:normal, 2:<unk>, 3:control, 4:user defined, 5:unused. 

    Return:
        a SentencePiece with given piece, score and piece_type
    '''
    new_p=copy.deepcopy(original_piece)# not a good way, but it does work.
    piece.piece=token
    piece.score=score
    piece.piece_type=piece_type
    return new_p

serializedStr=open(spm_path,"rb").read()
m=sentencepiece_model_pb2.ModelProto()
m.ParseFromString(serializedStr)

pieces.insert(0, new_piece_by_deepcopy("<s>",0,3,spm.pieces[0]))
pieces.insert(2, new_piece_by_deepcopy("</s>",0,3,spm.pieces[0]))
# this bos,eos are meant for being the same as a fairseq dict.

with open(new_spmPath+".model","wb") as f:
        f.write(m.SerializeToString())

In this case, the final spm will get \<s> and \<\/s>, vocabulary is increased by 2. Also it correctly tokenizes my sentences.