google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.32k stars 1.18k forks source link

Is it possible to add normalization rules into a trained sentence piece model? #1053

Closed lost-libra closed 2 months ago

lost-libra commented 2 months ago

According to official docs, I can add custom normalization rules during training.

However, how can I add custom normalization rules into a trained sp model?

If I retrain the sp model, all models rely on it should be retrained. So I would be very grateful if someone can help to figure out how to add custom normalization rules into a trained sp model.

Thanks in advance.

taku910 commented 2 months ago

At your own risk, it is possible to take out only the normalization rules from the new model and transfer them to old model.

The model file is a serialized protobuf. ModelProto is the model definition. Just replacing normalizer_spec field may work.

https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L321

See below for instructions on how to edit protobufs. https://protobuf.dev/

lost-libra commented 2 months ago

At your own risk, it is possible to take out only the normalization rules from the new model and transfer them to old model.

The model file is a serialized protobuf. ModelProto is the model definition. Just replacing normalizer_spec field may work.

https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L321

See below for instructions on how to edit protobufs. https://protobuf.dev/

Thank you for your kind help, it works for me. As a reference for others who also have this need, I leave my approach here.


import sentencepiece_model_pb2 as model

# replace normalizer spec of your 1st spm with your 2nd spm
# load model protobufs
fp_m1 = "your-1st-spm-file-path"
fp_m2 = "your-2nd-spm-file-path"
m1, m2 = model.ModelProto(), model.ModelProto()
m1.ParseFromString(open(fp_m1, "rb").read())
m2.ParseFromString(open(fp_m2, "rb").read())
# replace
m1.normalizer_spec.precompiled_charsmap = m2.normalizer_spec.precompiled_charsmap
# save
with open("your-final-spm-file-path", "wb") as f:
    f.write(m1.SerializeToString())```