Closed lost-libra closed 2 months ago
At your own risk, it is possible to take out only the normalization rules from the new model and transfer them to old model.
The model file is a serialized protobuf. ModelProto is the model definition. Just replacing normalizer_spec field may work.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L321
See below for instructions on how to edit protobufs. https://protobuf.dev/
At your own risk, it is possible to take out only the normalization rules from the new model and transfer them to old model.
The model file is a serialized protobuf. ModelProto is the model definition. Just replacing normalizer_spec field may work.
https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto#L321
See below for instructions on how to edit protobufs. https://protobuf.dev/
Thank you for your kind help, it works for me. As a reference for others who also have this need, I leave my approach here.
import sentencepiece_model_pb2 as model
# replace normalizer spec of your 1st spm with your 2nd spm
# load model protobufs
fp_m1 = "your-1st-spm-file-path"
fp_m2 = "your-2nd-spm-file-path"
m1, m2 = model.ModelProto(), model.ModelProto()
m1.ParseFromString(open(fp_m1, "rb").read())
m2.ParseFromString(open(fp_m2, "rb").read())
# replace
m1.normalizer_spec.precompiled_charsmap = m2.normalizer_spec.precompiled_charsmap
# save
with open("your-final-spm-file-path", "wb") as f:
f.write(m1.SerializeToString())```
According to official docs, I can add custom normalization rules during training.
However, how can I add custom normalization rules into a trained sp model?
If I retrain the sp model, all models rely on it should be retrained. So I would be very grateful if someone can help to figure out how to add custom normalization rules into a trained sp model.
Thanks in advance.